# Exploratory Data Analysis


The objective of this notebook is to explore datasets.
You will need Pandas, a tutorial is available [here](https://pandas.pydata.org/docs/user_guide/10min.html).


Please note that this notebook is inspired from notebooks published by Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.

## Dataset
We consider a dataset gathering information about elections and votes between 2000 and 2016 in the USA. Il also maps economics signals.

In [None]:
import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/master/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()
votes.head(3)

Q1. What is the size of the dataframe? (number of lines and columns)

Print the column names and their types.

Print the statistics of each numerical column (mean, std, quartile, min, max).

In [None]:
#[STUDENT]

Q2. What are the different values of the 'winner' variable? Recode this values into numbers 1,2,3,... and store this encoding in a new variable "winner2".

In [None]:
#[STUDENT]

## Descriptive analysis


Q3. Create a dataframe including only those variables: "winner", "votes_gop",
          'Unemployment_rate_2019', 'Median_Household_Income_2019',
          'Percent of adults with less than a high school diploma, 2015-19',
          "Percent of adults with a bachelor's degree or higher, 2015-19".
Keep the index "GEOID" as index of your dataframe. (use *set_index*).


In [None]:
#[STUDENT]

Q4. Create a frequency tab for each winner value. Build the plot with horizontal bars illustrating this frequency.

In [None]:
#[STUDENT]

Q5. Let's consider the 'Median_Household_Income_2019' variable. Transform this variable into a categorical one with 5 five labels. Create the frequency tabl and the associated graph.

In [None]:
#[STUDENT]

Q6. Provide descriptive statistics of all variables in the dataframe.

In [None]:
#[STUDENT]

Q7. Build an histogram for the variables vote_gop.

In [None]:
#[STUDENT]

Q8. Extract the correlation matrix. Graph it by using the seaborn package and its heatmap function.
Plot a matrix of point clouds of df2 variables with pd.plotting.scatter_matrix.
Interpret

In [None]:
#[STUDENT]

In [None]:
#[STUDENT]

## Visualization with maps


Q9. Below, we have two blocks of code generating two different graphs (the first map is a choropleth card). They use the same dataset but have different shapes. Comment these graphs and the differences.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# republican : red, democrat : blue
color_dict = {'republican': '#FF0000', 'democrats': '#0000FF'}

fig, ax = plt.subplots(figsize = (12,12))
grouped = votes.groupby('winner')
for key, group in grouped:
    group.plot(ax=ax, column='winner', label=key, color=color_dict[key])
plt.axis('off')

In [None]:
import plotly
import plotly.graph_objects as go
import pandas as pd
import geopandas as gpd
import numpy as np


centroids = votes.copy()
centroids.geometry = centroids.centroid
centroids['size'] = centroids['CENSUS_2010_POP'] / 10000  # to get reasonable plotable number

color_dict = {"republican": '#FF0000', 'democrats': '#0000FF'}
centroids["winner"] =  np.where(centroids['votes_gop'] > centroids['votes_dem'], 'republican', 'democrats')


centroids['lon'] = centroids['geometry'].x
centroids['lat'] = centroids['geometry'].y
centroids = pd.DataFrame(centroids[["county_name",'lon','lat','winner', 'CENSUS_2010_POP',"state_name"]])
groups = centroids.groupby('winner')

df = centroids.copy()

df['color'] = df['winner'].replace(color_dict)
df['size'] = df['CENSUS_2010_POP']/6000
df['text'] = df['CENSUS_2010_POP'].astype(int).apply(lambda x: '<br>Population: {:,} people'.format(x))
df['hover'] = df['county_name'].astype(str) +  df['state_name'].apply(lambda x: ' ({}) '.format(x)) + df['text']

fig_plotly = go.Figure(
  data=go.Scattergeo(
  locationmode = 'USA-states',
  lon=df["lon"], lat=df["lat"],
  text = df["hover"],
  mode = 'markers',
  marker_color = df["color"],
  marker_size = df['size'],
  hoverinfo="text"
  )
)

fig_plotly.update_traces(
  marker = {'opacity': 0.5, 'line_color': 'rgb(40,40,40)', 'line_width': 0.5, 'sizemode': 'area'}
)

fig_plotly.update_layout(
  title_text = "Reproduction of the \"Acres don't vote, people do\" map <br>(Click legend to toggle traces)",
  showlegend = True,
  geo = {"scope": 'usa', "landcolor": 'rgb(217, 217, 217)'}
)

## Normalization

Q10. Standardize all variables in the dataframe (do not overwrite the values!) and look at the histogram of variable 'the Median_Household_Income_2019 variable' before/after standardization.

In [None]:
#[STUDENT]

In [None]:
#[STUDENT]

In [None]:
#[STUDENT]

Q11. Varify that the distribution centered at zero, and that the empirical variance is indeed equal to 1.

In [None]:
#[STUDENT]

Q12. Create scaler, a Transformer that you build on the first 1000 rows of your df2 DataFrame with the exception of the variable to be explained winner. Check the mean and standard deviation of each column on these same observations. The parameters that will be used for later standardization are stored in the .mean_ and .scale_ attributes.

In [None]:
#[STUDENT]

# Outlier detection

Q13. Plot the distribution of each variable in a boxplot and analyze them. Do you see outliers?


In [None]:
#[STUDENT]

Q14. Identify, for each variable, individuals that are not include within the window +/- 3*std. How many lines would you remove in total?

In [None]:
#[STUDENT]

Q15. Let's process over all variables with a library: you can import and use the IsolationForest function from the sklearn.ensemble package.
Change the different parameter values to identify their impact.
Do you obtain results different from the analysis of single variables.

In [None]:
#[STUDENT]

Q16. Display in a scatter plot variables 'votes_gop' and 'Unemployment_rate_2019' and color points according whether they are outliers or not. Interpret.
You can change pairs of variables.

In [None]:
#[STUDENT]

Q17. Display for all pairs of variables using pairplot from seaborn.

In [None]:
#[STUDENT]