# Class 18: Maps and intro to statistical inference

Plan for today:
- Continuation of maps using geopandas
- Intro to statistical inference


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(18)   # get class code    
# YData.download.download_class_code(18, TRUE) # get the code with the answers 

YData.download.download_data("States_shapefile.geojson")
YData.download.download_data("state_demographics.csv")
YData.download.download_data("ne_110m_graticules_10.prj")
YData.download.download_data("ne_110m_graticules_10.shp")
YData.download.download_data("ne_110m_graticules_10.shx")
YData.download.download_data("ne_110m_graticules_10.dbf")

# YData.download_data("dennys.csv")

There are also similar functions to download the homework:

In [None]:
# YData.download.download_homework(7)  # downloads the homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px


#from datetime import datetime
#import statistics


import matplotlib.pyplot as plt
%matplotlib inline

## 1. Quick review: interactive heatmap with plotly

As we discussed last class, we can create interactive visualizations using the [plotly express package](https://plotly.com/python/plotly-express/), which are useful for exploring data to understand key trends. 

Let's briefly review creating heatmaps using plotly and explore using the `.pivot()` method (as opposed to the `.pivot_table()` method we discussed last class). 



In [None]:
gapminder = px.data.gapminder()   # the plotly package comes with the gapminder data

print(type(gapminder))

gapminder.head(3)

#### Review: Pivot tables and heatmaps

Heatmaps allow us to view data that is a function of two variables. 

In order to create a heatmap, we first need first transformat out data into a DataFrame that has appropirate rows and columns. One way we can do this is to use the pandas `.pivot_table(index = , columns = , values = , aggfunc = )` method, where the arguments to this method are:

- `index`: The variable we want in the rows of out DataFrame
- `columns`: The variable we want in the columns of our DataFrame
- `values`: The values we want to be in the DataFrame
- `aggfunc`: The function we will use to aggregate our data 

Let's apply the `.pivot_table()` method to our gapmider data to create a DataFrame called `gapminder_continent_wide` where:

- The rows are the different continents
- The columns are the year
- The values in the DataFrame are the average life expectancy (For each continent in each year)


In [None]:
# use the .pivot_table() method to aggregate data into a pivot table
gapminder_continent_wide = gapminder.pivot_table(index = 'continent', 
                                                 columns = 'year', 
                                                 values = 'lifeExp', 
                                                 aggfunc = 'mean')
gapminder_continent_wide.head()

Now that we have the appropriate DataFrame, let's use the plotly `imshow()` function to visualize it!

In [None]:
# use imshow() to visualize the data
fig = px.imshow(gapminder_continent_wide)

fig.update_layout(xaxis_title = "Year", yaxis_title = "")

#### The .pivot() method

If we want to create a similar to the `gapminder_continent_wide` DataFrame (in terms of having rows and columns be particular variables) but don't want to aggregrate data in our data, we can use the `.pivot(index = , columns = , values = )` method, where the arguments to this method are:

- `index`: The variable we want in the rows of out DataFrame
- `columns`: The variable we want in the columns of our DataFrame
- `values`: The values we want to be in the DataFrame

i.e., this method is the same as the `.pivot_table()` method, but it does not take an `aggfunc` argument. 

Let's create a DataFrame called `gapminder_continent_wide2` with the following properties: 

- The rows are data for each country
- The columns are data from each year.

Once you have created the `gapminder_continent_wide2`, visualize it using the plotly `px.heatmap()` function. 


In [None]:
# Use the .pivot() method to rearrange data into a pivot table




In [None]:
# Let's visualize the data as a heatmap



## 2. Spatial mapping with geopandas

Visualizing spatial data through maps is another powerful way to see trends in data. There are several mapping packages in Python. Here we will use the geopandas package to create maps. 

The geopandas package defines a geopandas DataFrame, which is the same as a pandas DataFrame but has an additional column called `geometry` which specifies geographic information. 

Let's explore this now!


### Visualizing boundaries

Let's start by looking some geopanda DataFrames and visualizing some geometric boundaries.

Below we load the gapminder data again and get the gapminder data from 2007. We also show which maps come with geopandas. 


In [None]:
import geopandas as gpd
import plotly.express as px

gapminder_2007 = px.data.gapminder().query("year == 2007")   # the plotly package comes with the gapminder data


# see which maps come with geopandas
gpd.datasets.available


Let's get a geopandas DataFrame that has th countries in the world...

In [None]:
# View the world geopandas DataFrame

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

print(type(world))

world.head()

In [None]:
# changing map properties



In [None]:
# plot just the united states



### Coordinate reference systems and projections

A coordinate reference system (CRS) is a framework used to precisely measure locations on the surface of the Earth as coordinates. The goal of any spatial reference system is to create a common reference frame in which locations can be measured precisely and consistently as coordinates, which can then be shared unambiguously, so that any recipient can identify the same location that was originally intended by the originator.

There are two different types of coordinate reference systems: Geographic Coordinate Systems and Projected Coordinate Systems. [Projected coordinate systems](https://en.wikipedia.org/wiki/List_of_map_projections) map 3D coordinates into a 2D plane so they can be plotted. Different projected coordiate systems perserve different properties, such as keeping all angles intact which is usefor for navigation (e.g., the Mercator projection) or keeping the size of land areas intact (e.g., the Eckert IV projection). 

A detailed discussion of CRS is beyond the scope of the class. But for the purposes of this class, it is just important that all layers in a map are using the same project (otherwise, for example, data points representing cities and the underlying spatial map won't line up). 

Let's very briefly explore different map projections... 


In [None]:
# Read Graticules (lines on a map)
graticules = gpd.read_file("ne_110m_graticules_10.shp")
print(graticules.crs)
graticules.head(3)

In [None]:
# Web Mercator projection - perserves angles (EPSG:4326 projection)

# print the default CRS



# plot the map




In [None]:
# Eckert IV is an equal-area projection  ("ESRI:54012")





In [None]:
# Robinson projection - neither equal-area nor conformal ("ESRI:54030") 





To learn more about "What your favorite map projection says about you" see: https://xkcd.com/977/

### Maps with layers and markers

We can also plot points on a map. When doing so, it's important that the points and the underlying map use the same coordinate reference system (CRS).

Let's add Denny's locations to the map of the United States!


In [None]:
# Let's start by getting a map of just the United States

state_map = world.query("name == 'United States of America'")

state_map

In [None]:
# visualize just the United States



In [None]:
# Get the coordinate reference system (CRS) for our map



Let's now load our Denny's data!

In [None]:
# Let's load our Denny's data
dennys = pd.read_csv("dennys.csv")
dennys.head()

To convert longitude and latitude coordinates into geometric objects; i.e., we will convert them into Shaply objects.  We can use the `gpd.points_from_xy(long, lat)` function. 

In [None]:
# Let's convert our longitude and latitude coordinates into geometric (Shapely) objects 




In [None]:
# Let's now convert out data into a geopandas DataFrame




In [None]:
# We can plot the location of the Denny's using the plot function



In [None]:
# Let's check the CRS



Before plotting data, we should set the appropriate coordinate reference system (CRS). This is partlcularly imporant when we are combining different layers on a map, such as putting city locations on the map that has the outlines of regional borders. 

The CRS that uses longitude and latitude coordinates is the [World Geodetic System 1984 (WGS84)](https://epsg.io/4326). This system is often referred to by its EPSG Geodetic Parameter Dataset code which is `4326`. 

Thus, we should set the set coordinate system to be EPSG 4326. We can do this using the method `.set_crs(4326)`. Let's set this on our `dennys_gpd` DataFrame. 


In [None]:
# Let's set the CRS to match the CRS of our map (which is EPSG 4326)






Now that we have our Denny's location in the same coordinate system as our map, we can add the points to the map. 

In [None]:
#state_map = gpd.read_file("States_shapefile.geojson")






### Choropleth maps

In choropleth maps, predefined regions are filled in with colors based values of interest. 

Typically to create a choropleth map we join data of interest onto a map. 

Let's explrore this now...


In [None]:
# join the gapminder data onto our world map





In [None]:
# plot a choropleth map of life expectancy






In [None]:
# change the color scale




In [None]:
# Can plot quantiles



### Anorther choropleth map example

Let's fit a choropleth map examining which states in the USA are growing in terms of people having lots of childern. 

Any thoughts on which state this might be? 

To start, let's load a map with the outlines of the states in the USA, and load demographic data.

In [None]:
state_map = gpd.read_file("States_shapefile.geojson")

print(state_map.crs)

state_map.head(3)

In [None]:
# load demographic data on the states

state_demographics = pd.read_csv("state_demographics.csv")
state_demographics.head(3)

In [None]:
# In order to join the DataFrames, we need to make sure the states have the same capitalization





In [None]:
# join the demographic information on to the the USA map





In [None]:
# Let's plot the map 



Is there anything [wrong with this map](https://xkcd.com/1138/)? 

In [None]:
# Let's look at the proportion of people under the age of 5





In [None]:
# Let's plot the new map



## 3. Statistical inference

In statistical inference we use a smaller sample of data to make claims about a larger population of data. 

As an example, let's look at the [2020 election](https://www.cookpolitical.com/2020-national-popular-vote-tracker) between Donald Trump and Joe Biden, and let's focus on the results from the state of Georgia. After all the votes had been counted, the resuts showed that:

- Biden received 2,461,854 votes
- Trump received 2,473,633 votes

Since we have all the votes on election data, we can precisely calculate the population parameter of the proportion of votes that Biden received, which we will denote with the symbol $\pi_{Biden}$. 

Let's create names `num_trump_votes` and `num_biden_votes`, and calculate `true_prop_Biden` which is the value $\pi_{Biden}$. 

In [None]:
num_trump_votes = 2461854  # 2,461,854
num_biden_votes = 2473633  # 2,473,633

# get True proportion of Biden's vote (excluding 3rd parties)



The code below creates a DataFrame called `georgia_df` that captures these election results. Each row in the DataFrame represents a votes. The column `Voted Biden` is `True` if a voter voted for Biden and `False` if the voter voted for Trump. 

In [None]:
# Create a DataFrame with the Georgia election data

biden_votes = np.repeat(True, num_biden_votes)     # create 2,473,633 Trues for the Biden votes
trump_votes = np.repeat(False, num_trump_votes)    # create 2,461,854 Falses for the Trump votes
election_outcome = np.concatenate((biden_votes, trump_votes))  # put the votes together

georgia_df = pd.DataFrame({"Voted Biden": election_outcome})  # create a DataFrame with the data
georgia_df = georgia_df.sample(frac = 1)   # shuffle the order to make it more realistic

georgia_df.head()

Now suppose we didn't know the actual value of $\pi_{Biden}$ and we wanted to estimate it based on a poll of 1,000 voters. We can simulate this by using the pandas `.sample(n = )` method.

Let's simulate sampling random voters

In [None]:
# sample 10 random points



In [None]:
# simulate proportions of voters that voted for Biden - i.e., p-hat




### Sampling distribution

Suppose 100 polls were conducted. How many of them would show that Biden would get the majority of the vote? 

Let's simulate this "sampling distribution" of statistics now... 


In [None]:
sample_size = 1000
num_simulations = 100

sampling_dist = []








In [None]:
# plot the sampling distribution 


