# Michigan EcoData Python Pandas Workshop Fall 2022

> This is a introduction tutorial on the programming language Python 3 and some basic data science libraries
> Some documentations 
- Python 3: https://docs.python.org/3/
- pandas: https://pandas.pydata.org/docs/index.html
- scikit-learn: https://scikit-learn.org/stable/
> 
> Download the data file in this link
https://www.kaggle.com/datasets/802ea18195176358ddec8265a33ca8909b606123d687dc06123d1b1d2154d45c?resource=download

## First import necessary libraries

In [None]:
import pandas as pd
import numpy as np
import os

from numpy import array
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

## Hello World

In [None]:
print("Hello World")

In [None]:
x = "Hello World"
print(type(x))
print(x)

y = 3
z = 6
print(y + z)

> Here are some basic arithmetic operations in Python

In [None]:
print(z/y)
print(y%z)
print(y**z)

## Array and Loops

> This is the constructor of a list. You can access elements of list by calling their index

In [None]:
arr = [1,2,3]
arr[0]

> In Python there are two ways to use a for loop to iterate through a list

In [None]:
for i in arr:
    print (i)

In [None]:
for i in range(len(arr)):
    arr[i]+=1
print(arr)

> A lambda function is a very powerful tool in Python
>
> It is a function that can take multiple argument but have 1 expression
>
> with lambda functions you can apply it to your list through map
>
> Here I have a lambda function that squares a number only if it is even

In [None]:
squareIfEven = lambda x: x*x if x%2 == 0 else x
a = list(map(squareIfEven, arr))
a

## Pandas

> What is pandas

- pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
- with pandas you can create dataframes and organize and clean your data
- Dataframes are like excel tables, they have columns and rows

In [None]:
df = pd.read_csv('pollution_data.csv')

> If you are in Jupyter Notebook you should run this command to import data
>
`df = pd.read_csv("pollution_data.csv")`

In [None]:
df.shape

In [None]:
df.columns

In [None]:
columns1 = ['Date', 'City','County','State']
columns2 = ['mil_miles','Population Staying at Home','Population Not Staying at Home']
columns3 = ['o3_median','pm25_median','no2_median','so2_median']
df_new = df[columns1 + columns2 + columns3]
df_new

## data cleaning and drop na

> most of the time, the data you are working with has many null values
>
> We want to deal with these null values by deleting the rows or converting null to a predefined value

In [None]:
df_new.dropna()

In [None]:
df_new.dropna(subset=['so2_median'])

In [None]:
df_new.dropna(subset=['pm25_median','o3_median'], inplace = True)
df_new.head()

In [None]:
df_new.dtypes

> As you can see, the population columns are the object data type
>
> panda is reading the values as string from the CSV file
>
> This is not what we want since python cannot comprehend values in this format

In [None]:
df_new['population not at home new']= df_new['Population Not Staying at Home'].apply(lambda x: int(x.replace(',','')))

In [None]:
df_new['population at home new']= df_new['Population Staying at Home'].apply(lambda x: int(x.replace(',','')))

In [None]:
df_new.dtypes

In [None]:
df_new.head()

> After converting the population columns to int, we can delete the original population columns

In [None]:
df_new.drop(columns=['Population Staying at Home', 'Population Not Staying at Home'], inplace=True)

In [None]:
df_new.head()

In [None]:
df_new['Date']= pd.to_datetime(df_new['Date'])
df_new.dtypes

In [None]:
pd.DatetimeIndex(df_new['Date']).month

## iloc and Loc



> Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

In [None]:
df_new.iloc[:,5:9]

In [None]:
df_new.loc[lambda x: x.State == "MI"]

In [None]:
df_new[df_new['State'] == 'MI']

In [None]:
df_new.loc[lambda x: x.State == "MI"].iloc[:,5:9]

## group by and aggregate
> Documentation:
- Groupby: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
- Aggregate: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html

In [None]:

df_new.groupby(['City'])[['population not at home new','o3_median', 'pm25_median','no2_median','so2_median']].mean()


In [None]:
df_city = df_new.groupby(['City'])[['population not at home new','mil_miles','o3_median', 'pm25_median','no2_median','so2_median']].agg([np.mean, np.std, np.median])
df_city.head()

## matplotlib
> matplotlib is a useful package for graphing data
>
> it can create line graph, scatter plots histograms and etc

Documentation: https://matplotlib.org/stable/api/pyplot_summary.html

In [None]:
plt.scatter(df_city['population not at home new']['median'], df_city.pm25_median['median'])

In [None]:
df_city['pm25_median']['median'].hist()

In [None]:
df_MI = df_new.loc[lambda x: x.State == "MI"]
df_MI


In [None]:
df_MI.dtypes

In [None]:
plt.plot(df_MI.Date, df_MI.pm25_median)

## basic linear regression and Scikit-learn

In [None]:
plt.scatter(df_city['population not at home new']['median'], df_city.pm25_median['median'])

In [None]:
X = df_city['population not at home new'][['median']]
Y = df_city['pm25_median'][['median']]

In [None]:
regr = linear_model.LinearRegression()
regr.fit(X,Y)

In [None]:
Y_pred = regr.predict(X)
plt.scatter(X, Y)
plt.plot(X,Y_pred)


In [None]:
regr.score(X, Y)

> multiple linear regression
>
> In this case, we have 2 predictor and 1 response

In [None]:
regr2=linear_model.LinearRegression()
idx = pd.IndexSlice
X2 = df_city.loc[:,idx[['population not at home new','mil_miles'],'median']]
X2.head()

In [None]:
regr2.fit(X2,Y)

In [None]:
Y_pred2 = regr2.predict(X2)

In [None]:
regr2.score(X2,Y)

In [None]:
plt.scatter(Y_pred2,Y)