#### About: 

We learning about Exploratory data analysis using pandas and air pollution and temperature data for the city of Chicago. 

Exploratory data analysis (EDA), pioneered by John Tukey, set a foundation for the field of data science. The key idea of EDA is that the first and most important step in any project based on data is to look at the data. By summarizing and visualizing the data, you can gain valuable intuition and understanding of the project.

The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. i.e 

- to make sure we have the right data,
-  identify any problems with the dataset,
- determining if what we answer our desired question and get rough idea of what the answer will look like.

Exploratory data analysis is the search for patterns and trends in a given data set. 
Looking at the bigger picture bigger picture Data-driven science,
we start by collecting a data set of resonable size, and then looking for patterns that ideally will play the role of hypotheses for future analysis.  
 
 
After the basic exploratory analysis we can pause and think that if our question needs refinement or if we need to collect more or new data.

Exploratory data analysis is about looking carefully at your dataset and identify any errors in data collection processing, finding violations of statistical assumptions, and suggesting interesting hypotheses[2].

As first step we could answer the following questions: Who constructed this dataset, when, and why? what is size of the data and the description of the various columns?


##### Topics Covered:

- Pandas functions for computing descriptive statistics
- Extracting rows based on conditionals 
- Sorting based on row values
- Compute relative difference 
- renaming columns, groupby function etc 


##### Acknowledgement 
topics and code inspired from book chapter 4: Exploratory Data Analysis with R
and http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf 


In [None]:
pip install pyreadr

Collecting pyreadr
  Downloading pyreadr-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl (410 kB)
[K     |████████████████████████████████| 410 kB 10.7 MB/s 
Installing collected packages: pyreadr
Successfully installed pyreadr-0.4.0
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import pyreadr
import pandas as pd

In [None]:
result = pyreadr.read_r('data/chicago')

In [None]:
df = result[None]

##### Understanding the Sample using basic Stats using shape(), describe(),info() and summary ()



> The data that come from making a particular measurement on all of the subjects in a sample represent our observations for a single characteristic such as age, gender, speed at a task, or response to a stimulus. We should think of these measurements as representing a “sample distribution” of the variable, which in turn more or less represents the “population distribution” of the variable. The usual goal of univariate non-graphical EDA is to better appreciate the “sample distribution” and also to make some tentative conclusions about what population distribution(s) is/are compatible with the sample distribution. Outlier detection is also a part of this analysis.


In [None]:
df.shape


(6940, 8)


### Central tendency

The central tendency or “location” of a distribution has to do with typical or middle values. The common, useful measures of central tendency are the statistics called (arithmetic) mean, median, and sometimes mode. 



#### Mean: 
![](https://paper-attachments.dropbox.com/s_478598AFA2F5777FB9289D2A6B80C2413B0ADE29B5C1C858EBCBF98AFD0D611D_1611725655490_Screen+Shot+2021-01-27+at+4.33.58+pm.png)


For any symmetrically shaped distribution (i.e., one with a symmetric histogram or pdf or pmf) the mean is the point around which the symmetry holds. For non-symmetric distributions, the mean is the “balance point”: if the histogram is cut out of some homogeneous stiff material such as cardboard, it will balance on a fulcrum placed at the mean.


#### Mode: 

The median is another measure of central tendency. The sample median is the middle value after all of the values are put in an ordered list. If there are an even number of values, take the average of the two middle values. (If there are ties at the middle, some special adjustments are made by the statistical software we will use. In unusual situations for discrete random variables, there may not be a unique median.)

#### Mean vs Median: 

For unimodal skewed (asymmetric) distributions, the mean is farther in the direction of the “pulled out tail” of the distribution than the median is. Therefore, for many cases of skewed distributions, the median is preferred as a measure of central tendency. For example, according to the US Census Bureau 2004 Economic Survey, the median income of US families, which represents the income above and below which half of families fall, was $43,318. This seems a better measure of central tendency than the mean of $60,828, which indicates how much each family would have if we all shared equally

#### Robustness: 

The median has a very special property called robustness. A sample statistic is “robust” if moving some data tends not to change the value of the statistic. The median is highly robust, because you can move nearly all of the upper half and/or lower half of the data values any distance away from the median without changing the median. More practically, a few very high values or very low values usually have no effect on the median.

#### Mode

A rarely used measure of central tendency is the mode, which is the most likely or frequently occurring value. More commonly we simply use the term “mode” when describing whether a distribution has a single peak (unimodal) or two or more peaks (bimodal or multi-modal). In symmetric, unimodal distributions, the mode equals both the mean and the median. In unimodal, skewed distributions the mode is on the other side of the median from the mean. In multi-modal distributions there is either no unique highest mode, or the highest mode may well be unrepresentative of the central tendency.

The most common measure of central tendency is the mean. For skewed distribution or when there is concern about outliers, the median may be preferred.



#### Spread

Several statistics are commonly used as a measure of the spread of a distribution, including variance, standard deviation, and interquartile range. Spread is an indicator of how far away from the center we are still likely to find data values.

The variance and standard deviation are two useful measures of spread. 

The variance is the mean of the squares of the individual deviations. 

The standard deviation is the square root of the variance. For Normally distributed data, approximately 95% of the values lie within 2 sd of the mean.


In [None]:
df.describe()

Unnamed: 0,tmpd,dptp,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
count,6939.0,6938.0,2493.0,6698.0,6940.0,6940.0
mean,50.309339,40.341686,16.230958,33.895206,19.435513,25.231882
std,19.412801,18.48724,8.69775,17.967363,11.385984,7.991389
min,-16.0,-25.625,1.7,2.0,0.152778,6.158333
25%,35.0,27.0,9.7,21.5,10.072917,19.653819
50%,51.0,39.875,14.657143,30.278846,18.521802,24.555556
75%,67.0,55.75,20.6,42.0,27.000996,30.13904
max,92.0,78.25,61.5,365.0,66.5875,62.479984


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6940 entries, 0 to 6939
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city        6940 non-null   object 
 1   tmpd        6939 non-null   float64
 2   dptp        6938 non-null   float64
 3   date        6940 non-null   object 
 4   pm25tmean2  2493 non-null   float64
 5   pm10tmean2  6698 non-null   float64
 6   o3tmean2    6940 non-null   float64
 7   no2tmean2   6940 non-null   float64
dtypes: float64(6), object(2)
memory usage: 433.9+ KB


In [None]:
df.head()

Unnamed: 0,city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
0,chic,31.5,31.5,1987-01-01,,34.0,4.25,19.988095
1,chic,33.0,29.875,1987-01-02,,,3.304348,23.190994
2,chic,33.0,27.375,1987-01-03,,34.166667,3.333333,23.815476
3,chic,29.0,28.625,1987-01-04,,47.0,4.375,30.434524
4,chic,32.0,28.875,1987-01-05,,,4.75,30.333333


In [None]:
df_subset = df

In [None]:
df_subset

Unnamed: 0,city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
0,chic,31.5,31.500,1987-01-01,,34.000000,4.250000,19.988095
1,chic,33.0,29.875,1987-01-02,,,3.304348,23.190994
2,chic,33.0,27.375,1987-01-03,,34.166667,3.333333,23.815476
3,chic,29.0,28.625,1987-01-04,,47.000000,4.375000,30.434524
4,chic,32.0,28.875,1987-01-05,,,4.750000,30.333333
...,...,...,...,...,...,...,...,...
6935,chic,40.0,33.600,2005-12-27,23.560000,27.000000,4.468750,23.500000
6936,chic,37.0,34.500,2005-12-28,17.750000,27.500000,3.260417,19.285628
6937,chic,35.0,29.400,2005-12-29,7.450000,23.500000,6.794837,19.972222
6938,chic,36.0,31.000,2005-12-30,15.057143,19.200000,3.034420,22.805556


#### Select a subset of 3 first columns


In [None]:
df_subset[['city','tmpd','dptp']] 

Unnamed: 0,city,tmpd,dptp
0,chic,31.5,31.500
1,chic,33.0,29.875
2,chic,33.0,27.375
3,chic,29.0,28.625
4,chic,32.0,28.875
...,...,...,...
6935,chic,40.0,33.600
6936,chic,37.0,34.500
6937,chic,35.0,29.400
6938,chic,36.0,31.000


In [None]:
df_filtered_rows = df.loc[df['pm25tmean2'] >30] #extract the rows of the chicago data frame where the levels of
#PM2.5 are greater than 30 (which is a reasonably high level)

In [None]:
df_filtered_rows.head()

Unnamed: 0,city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
4034,chic,23.0,21.9,1998-01-17,38.1,32.461538,3.180556,25.3
4040,chic,28.0,25.8,1998-01-23,33.95,38.692308,1.75,29.376299
4137,chic,55.0,51.3,1998-04-30,39.4,34.0,10.786232,25.313095
4138,chic,59.0,53.7,1998-05-01,35.4,28.5,14.295125,31.429046
4139,chic,57.0,52.0,1998-05-02,33.3,35.0,20.662879,26.798611


In [None]:
df_filtered_rows['pm25tmean2'].describe() 

count    194.000000
mean      36.626268
std        5.742143
min       30.050000
25%       32.125000
50%       35.042857
75%       39.533125
max       61.500000
Name: pm25tmean2, dtype: float64

there are now only 194 rows in the data frame and the distribution of
the pm25tmean2 values is as above

### Task: Extract the rows where PM2.5 is greater than 30 and temperature is greater than #80 degrees Fahrenheit.

In [None]:
df_filtered_rows = df.loc[(df['pm25tmean2'] >30) & (df['tmpd'] > 80)]

In [None]:
df_filtered_rows 

Unnamed: 0,city,tmpd,dptp,date,pm25tmean2,pm10tmean2,o3tmean2,no2tmean2
4252,chic,81.0,71.2,1998-08-23,39.6,59.0,45.863636,14.326389
4266,chic,81.0,70.4,1998-09-06,31.5,50.5,50.6625,20.3125
5314,chic,82.0,72.2,2001-07-20,32.3,58.5,33.003804,33.675
5326,chic,84.0,72.9,2001-08-01,43.7,81.5,45.177355,27.442391
5333,chic,85.0,72.6,2001-08-08,38.8375,70.0,37.980468,27.627433
5334,chic,84.0,72.6,2001-08-09,38.2,66.0,36.732452,26.467424
5649,chic,82.0,67.4,2002-06-20,33.0,80.5,47.426731,30.767029
5652,chic,82.0,63.5,2002-06-23,42.5,65.0,54.880435,30.03913
5667,chic,81.0,70.4,2002-07-08,33.1,64.0,45.349693,27.678571
5677,chic,82.0,66.2,2002-07-18,38.85,72.5,44.980455,26.069048


In [None]:
df_filtered_rows[['date','tmpd','pm25tmean2']] 

Unnamed: 0,date,tmpd,pm25tmean2
4252,1998-08-23,81.0,39.6
4266,1998-09-06,81.0,31.5
5314,2001-07-20,82.0,32.3
5326,2001-08-01,84.0,43.7
5333,2001-08-08,85.0,38.8375
5334,2001-08-09,84.0,38.2
5649,2002-06-20,82.0,33.0
5652,2002-06-23,82.0,42.5
5667,2002-07-08,81.0,33.1
5677,2002-07-18,82.0,38.85


In [None]:
df_filtered_rows['date'] = pd.to_datetime(df_filtered_rows.date)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
df_filtered_rows.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 4252 to 6789
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   city        17 non-null     object        
 1   tmpd        17 non-null     float64       
 2   dptp        17 non-null     float64       
 3   date        17 non-null     datetime64[ns]
 4   pm25tmean2  17 non-null     float64       
 5   pm10tmean2  17 non-null     float64       
 6   o3tmean2    17 non-null     float64       
 7   no2tmean2   17 non-null     float64       
dtypes: datetime64[ns](1), float64(6), object(1)
memory usage: 1.2+ KB


In [None]:
df_sorted = df_filtered_rows[['date','pm25tmean2']].sort_values(by='date')


In [None]:
df_sorted.head()

Unnamed: 0,date,pm25tmean2
4252,1998-08-23,39.6
4266,1998-09-06,31.5
5314,2001-07-20,32.3
5326,2001-08-01,43.7
5333,2001-08-08,38.8375


In [None]:
df_sorted.tail()

Unnamed: 0,date,pm25tmean2
6749,2005-06-24,31.857143
6752,2005-06-27,51.5375
6753,2005-06-28,31.2
6772,2005-07-17,32.7
6789,2005-08-03,37.9


### RENAME SPECIFIC COLUMNS


In [None]:
df = df.rename(columns={'dptp': 'dewpoint', 'pm25tmean2': 'pm25'})

In [None]:
df.head()

Unnamed: 0,city,tmpd,dewpoint,date,pm25,pm10tmean2,o3tmean2,no2tmean2
0,chic,31.5,31.5,1987-01-01,,34.0,4.25,19.988095
1,chic,33.0,29.875,1987-01-02,,,3.304348,23.190994
2,chic,33.0,27.375,1987-01-03,,34.166667,3.333333,23.815476
3,chic,29.0,28.625,1987-01-04,,47.0,4.375,30.434524
4,chic,32.0,28.875,1987-01-05,,,4.75,30.333333


In [None]:
# we create a pm25detrend variable that subtracts the mean from the pm25 variable

In [None]:
df_new = df.assign(pm25detrend = df["pm25"] - df["pm25"].mean()) 

In [None]:

df_new.tail()

Unnamed: 0,city,tmpd,dewpoint,date,pm25,pm10tmean2,o3tmean2,no2tmean2,pm25detrend
6935,chic,40.0,33.6,2005-12-27,23.56,27.0,4.46875,23.5,7.329042
6936,chic,37.0,34.5,2005-12-28,17.75,27.5,3.260417,19.285628,1.519042
6937,chic,35.0,29.4,2005-12-29,7.45,23.5,6.794837,19.972222,-8.780958
6938,chic,36.0,31.0,2005-12-30,15.057143,19.2,3.03442,22.805556,-1.173815
6939,chic,35.0,30.1,2005-12-31,15.0,23.5,2.53125,13.25,-1.230958


In [None]:
df['date'] = pd.to_datetime(df.date)



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6940 entries, 0 to 6939
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   city        6940 non-null   object        
 1   tmpd        6939 non-null   float64       
 2   dewpoint    6938 non-null   float64       
 3   date        6940 non-null   datetime64[ns]
 4   pm25        2493 non-null   float64       
 5   pm10tmean2  6698 non-null   float64       
 6   o3tmean2    6940 non-null   float64       
 7   no2tmean2   6940 non-null   float64       
dtypes: datetime64[ns](1), float64(6), object(1)
memory usage: 433.9+ KB


In [None]:
df['year'] = df['date'].dt.year


In [None]:
df.head()

Unnamed: 0,city,tmpd,dewpoint,date,pm25,pm10tmean2,o3tmean2,no2tmean2,year
0,chic,31.5,31.5,1987-01-01,,34.0,4.25,19.988095,1987
1,chic,33.0,29.875,1987-01-02,,,3.304348,23.190994,1987
2,chic,33.0,27.375,1987-01-03,,34.166667,3.333333,23.815476,1987
3,chic,29.0,28.625,1987-01-04,,47.0,4.375,30.434524,1987
4,chic,32.0,28.875,1987-01-05,,,4.75,30.333333,1987


In [None]:
#apply a transformation to each row in the data frame 
df.apply(lambda x: x.tmpd*1.8 + 32, axis=1)

0        88.7
1        91.4
2        91.4
3        84.2
4        89.6
        ...  
6935    104.0
6936     98.6
6937     95.0
6938     96.8
6939     95.0
Length: 6940, dtype: float64

In [None]:
df

Unnamed: 0,city,tmpd,dewpoint,date,pm25,pm10tmean2,o3tmean2,no2tmean2,year
0,chic,31.5,31.500,1987-01-01,,34.000000,4.250000,19.988095,1987
1,chic,33.0,29.875,1987-01-02,,,3.304348,23.190994,1987
2,chic,33.0,27.375,1987-01-03,,34.166667,3.333333,23.815476,1987
3,chic,29.0,28.625,1987-01-04,,47.000000,4.375000,30.434524,1987
4,chic,32.0,28.875,1987-01-05,,,4.750000,30.333333,1987
...,...,...,...,...,...,...,...,...,...
6935,chic,40.0,33.600,2005-12-27,23.560000,27.000000,4.468750,23.500000,2005
6936,chic,37.0,34.500,2005-12-28,17.750000,27.500000,3.260417,19.285628,2005
6937,chic,35.0,29.400,2005-12-29,7.450000,23.500000,6.794837,19.972222,2005
6938,chic,36.0,31.000,2005-12-30,15.057143,19.200000,3.034420,22.805556,2005


In [None]:
df_new  = df.groupby("year", sort=True)

In [None]:
df_new.head()

Unnamed: 0,city,tmpd,dewpoint,date,pm25,pm10tmean2,o3tmean2,no2tmean2,year
0,chic,31.5,31.500,1987-01-01,,34.000000,4.250000,19.988095,1987
1,chic,33.0,29.875,1987-01-02,,,3.304348,23.190994,1987
2,chic,33.0,27.375,1987-01-03,,34.166667,3.333333,23.815476,1987
3,chic,29.0,28.625,1987-01-04,,47.000000,4.375000,30.434524,1987
4,chic,32.0,28.875,1987-01-05,,,4.750000,30.333333,1987
...,...,...,...,...,...,...,...,...,...
6575,chic,33.0,24.800,2005-01-01,10.5250,16.500000,9.208333,14.152778,2005
6576,chic,44.0,39.600,2005-01-02,,16.000000,6.447917,16.069444,2005
6577,chic,34.0,31.500,2005-01-03,17.5000,11.500000,12.687500,18.625845,2005
6578,chic,33.0,29.000,2005-01-04,8.8375,11.600000,17.548460,19.315217,2005


In [None]:
#df.pivot_table(index='Year', values = 'tmpd', aggfunc=max )