# Exploratory Data Analysis

The objective of this notebook is to practice EDA.

Most of the work will be done with pandas; a tutorial is available  [here](https://pandas.pydata.org/docs/user_guide/10min.html).


This notebook is heavily inspired by the notebooks published by Galiana, Lino. 2023. Python for Data Science. https://doi.org/10.5281/zenodo.8229676.


## Dataset Exploration
<div class="alert alert-block alert-warning">
The dataset under consideration contains information on elections and votes in the United States from 2000 to 2016, along with economic information. The following lines allow you to load the dataset.
</div>


In [1]:

!pip install geopandas
!pip install pandas

import pandas as pd
import geopandas as gpd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable




In [2]:
votes = pd.read_csv("data/votes.csv")
votes.FIPS = votes.FIPS.astype(str)
shp = gpd.read_file("data/votes_shp")

votes = shp.merge(votes,on="FIPS")

<div class="alert alert-block alert-info">
* Display the first three rows of the table, and the last three.
    <br>
* What is the size of the dataframe?
    <br>
* Display the names of the columns and their types.
    <br>
* In your opinion, what do the different columns represent?
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
* Display the statistics for each column (mean, variance, quartiles, min, max).
    <br>
* In your opinion, what does each column represent?
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
What are the different values of 'winner'? Assign an integer to each category of this column and save it in a new variable "winner2".
</div>

In [None]:
#[STUDENT]

## Descriptive Analysis
<div class="alert alert-block alert-info">
We will focus on a few columns from this dataset. Create a dataframe that includes only the variables: 'winner', 'votes_gop', 'Unemployment_rate_2019', 'Median_Household_Income_2019', 'Percent of adults with less than a high school diploma, 2015-19', "Percent of adults with a bachelor's degree or higher, 2015-19".

Keep the index "GEOID" as the index of the dataframe (use set_index).
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
    Create a dataframe that, for each value of "winner2", gives the total number of occurrences. Display the histogram.
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
We consider the variable 'Median_Household_Income_2019'. We want to transform it into a categorical variable with 5 labels. How can we do this and what should we pay attention to?
    <br>
Create a table of the number of occurrences and display the histogram.
</div>

In [None]:
#[STUDENT]


In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
Display the statistics of all columns in the dataframe.
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
Create a histogram of the variable 'Unemployment_rate_2019' with 20 bins.
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
Extract the correlation matrix. Display the heatmap of this matrix. Display the scatter plot of the dataframe. What link can you establish between all these visualizations? What can you deduce?
</div>

In [None]:
#[STUDENT]


In [None]:
#[STUDENT]


##  Visualizing Geographic Data
<div class="alert alert-block alert-warning">
The two following code blocks generate two different types of graphs (the first one is called a "choropleth"). Same dataset, but different representations. Discuss and comment.
</div>

In [None]:
# republican : red, democrat : blue
color_dict = {'republican': '#FF0000', 'democrats': '#0000FF'}

fig, ax = plt.subplots(figsize = (12,12))
grouped = votes.groupby('winner')
for key, group in grouped:
    group.plot(ax=ax, label=key, color=color_dict[key])
plt.axis('off')

In [None]:
import plotly
import plotly.graph_objects as go

centroids = votes.copy()
centroids.geometry = centroids.centroid
centroids['size'] = centroids['CENSUS_2010_POP'] / 10000  # to get reasonable plotable number

color_dict = {"republican": '#FF0000', 'democrats': '#0000FF'}
centroids["winner"] =  np.where(centroids['votes_gop'] > centroids['votes_dem'], 'republican', 'democrats')
centroids['lon'] = centroids['geometry'].x
centroids['lat'] = centroids['geometry'].y
centroids = pd.DataFrame(centroids[["county_name",'lon','lat','winner', 'CENSUS_2010_POP',"state_name"]])
groups = centroids.groupby('winner')

df = centroids.copy()

df['color'] = df['winner'].replace(color_dict)
df['size'] = df['CENSUS_2010_POP']/6000
df['text'] = df['CENSUS_2010_POP'].astype(int).apply(lambda x: '<br>Population: {:,} people'.format(x))
df['hover'] = df['county_name'].astype(str) +  df['state_name'].apply(lambda x: ' ({}) '.format(x)) + df['text']

fig_plotly = go.Figure(
  data=go.Scattergeo(
  locationmode = 'USA-states',
  lon=df["lon"], lat=df["lat"],
  text = df["hover"],
  mode = 'markers',
  marker_color = df["color"],
  marker_size = df['size'],
  hoverinfo="text"
  )
)

fig_plotly.update_traces(
  marker = {'opacity': 0.5, 'line_color': 'rgb(40,40,40)', 'line_width': 0.5, 'sizemode': 'area'}
)

fig_plotly.update_layout(
  title_text = "Reproduction of the \"Acres don't vote, people do\" map <br>(Click legend to toggle traces)",
  showlegend = True,
  geo = {"scope": 'usa', "landcolor": 'rgb(217, 217, 217)'}
)

<div class="alert alert-block alert-info">
    Use StandardScaler and MinMaxScaler from sklearn to normalize the variables in the dataframe. Compare the histograms obtained, for example, for the variable "Median_Household_Income_2019".
</div>

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler,MinMaxScaler
#[STUDENT]



In [None]:
#[STUDENT]


In [None]:
#[STUDENT]


In [None]:
#[STUDENT]


## Anomaly Detection

<div class="alert alert-block alert-info">
Display the distribution of each variable with a boxplot and a violin plot. Do you observe any anomalies? What is the purpose of these two visualizations?
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
Identify, for each variable, the individuals not included in the window +/- 3*std. How many individuals does this concern? Create a new column "outlier" in the dataframe, assigning 1 when an individual is an outlier, and 0 otherwise.
</div>

In [None]:
#[STUDENT]


<div class="alert alert-block alert-info">
Display a scatter plot of the variables 'votes_gop' and 'Unemployment_rate_2019', coloring the points according to whether they are outliers or not.
</div>

In [1]:
#[STUDENT]


<div class="alert alert-block alert-info">
Display a scatter plot of all the variables.
</div>

In [None]:
#[STUDENT]
