# Introduction

This week we are going to begin our exploration of one of  the main Python packages for data analysis: Pandas. Pandas  (which is short for Panel Data Analysis) is a package built for analyzing many types of tabular data that comes in rows and columns. It has efficient methods for loading, storing, modifying, and analyzing this type of data and is one of the key pieces of the Python analytics toolkit. We will return to pandas several times throughout the semester to perform increasingly complex operations on our datasets.  

The readings for this week are: 
* [Sections 3.1-3.3 of the Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* You might also want to look at [this notebook](https://nbviewer.jupyter.org/github/pydata/pydata-book/blob/2nd-edition/ch05.ipynb) for additional examples

In [2]:
import pandas as pd

Since pandas is a separate module, we need to start by importing it. The standard shorthand for pandas is `pd` so we will use that syntax throughout this module. Pandas provides a couple of data structures that we will use quite a bit. The first is the Series, which we will use to represent columns of data. The Series is similar to the list and tuple structures that we encountered in the previous notebook. They also support a large collection of functions for automatically processing the data they contain. Description of these methods can be found in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).

In [16]:
my_first_column = pd.Series([1,10,3,5,9,3,2,2,1])

In [17]:
my_first_column

0     1
1    10
2     3
3     5
4     9
5     3
6     2
7     2
8     1
dtype: int64

The first thing to notice is that pandas has automatically created an index for us - this will be a very useful feature when we are dealing with larger, more complex datasets. It also tells us what type of data is stored in the Series. To access individual elements in the Series, we can use a similar syntax to lists and tuples or appeal to the index directly with the `.iloc` attribute. 

In [4]:
my_first_column[3]

5

In [5]:
my_first_column[-1]

KeyError: -1

In [18]:
my_first_column.loc[3]

5

In [19]:
my_first_column.iloc[-1]

1

In [70]:
my_first_column.index = ['a','b','c','d','e','f','g','h','i']

In [71]:
my_first_column

a     1
b    10
c     3
d     5
e     9
f     3
g     2
h     2
i     1
dtype: int64

In [72]:
my_first_column.loc['d']

5

In [73]:
my_second_column = pd.Series([5,44,2,1,0,0,3,4,6])

In [74]:
my_second_column.max()

44

In [75]:
my_second_column.sum()

65

In [76]:
my_second_column.mean()

7.222222222222222

We can group columns of data together into larger collections called dataframes. These will be our standard structure for interacting with all sorts of data examples. For now, we will think of these as labelled columns of data, where each data is represented by a Series. This matches the standard presentation of data as observations (rows) and variable (columns). As with Series, dataframes come with many useful methods that will make our lives easier, documented [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [77]:
#Data source: https://www.collegetuitioncompare.com/
tuition_df =pd.DataFrame({"WSU":[27302,27697,28206,27825,26899,27249,27484,27991,28520],
                          "UI":[19700,19598,19630,20294,20640,21300,21350,21820,21820],
                         "year":[2012,2013,2014,2015,2016,2017,2018,2019,2020]})

In [78]:
tuition_df

Unnamed: 0,WSU,UI,year
0,27302,19700,2012
1,27697,19598,2013
2,28206,19630,2014
3,27825,20294,2015
4,26899,20640,2016
5,27249,21300,2017
6,27484,21350,2018
7,27991,21820,2019
8,28520,21820,2020


In [79]:
tuition_df =pd.DataFrame({"WSU":[27302,27697,28206,27825,26899,27249,27484,27991,28520],
                          "UI":[19700,19598,19630,20294,20640,21300,21350,21820,21820]},
                         index=[2012,2013,2014,2015,2016,2017,2018,2019,2020])

In [80]:
tuition_df

Unnamed: 0,WSU,UI
2012,27302,19700
2013,27697,19598
2014,28206,19630
2015,27825,20294
2016,26899,20640
2017,27249,21300
2018,27484,21350
2019,27991,21820
2020,28520,21820


In [81]:
tuition_df.columns

Index(['WSU', 'UI'], dtype='object')

In [82]:
column_dictionary = {
    'Column 1':my_first_column,
    'Column 2':my_second_colunm
}

In [83]:
my_second_df = pd.DataFrame(column_dictionary)

In [84]:
my_second_df

Unnamed: 0,Column 1,Column 2
a,1.0,
b,10.0,
c,3.0,
d,5.0,
e,9.0,
f,3.0,
g,2.0,
h,2.0,
i,1.0,
0,,5.0


In [85]:
tuition_df.max()

WSU    28520
UI     21820
dtype: int64

In [86]:
tuition_df.head()

Unnamed: 0,WSU,UI
2012,27302,19700
2013,27697,19598
2014,28206,19630
2015,27825,20294
2016,26899,20640


In [87]:
tuition_df.tail()

Unnamed: 0,WSU,UI
2016,26899,20640
2017,27249,21300
2018,27484,21350
2019,27991,21820
2020,28520,21820


In [88]:
tuition_df.describe()

Unnamed: 0,WSU,UI
count,9.0,9.0
mean,27685.888889,20683.555556
std,509.536663,923.079508
min,26899.0,19598.0
25%,27302.0,19700.0
50%,27697.0,20640.0
75%,27991.0,21350.0
max,28520.0,21820.0


Accessing individual elements in dataframes requires us to specify both a row and a column but we will also sometimes want to access full rows or columns as well. 

In [89]:
my_second_df.loc[0]

Column 1    NaN
Column 2    5.0
Name: 0, dtype: float64

In [90]:
my_second_df.loc['a']

Column 1    1.0
Column 2    NaN
Name: a, dtype: float64

In [91]:
my_second_df.iloc[12]

Column 1    NaN
Column 2    1.0
Name: 3, dtype: float64

In [92]:
tuition_df['WSU']

2012    27302
2013    27697
2014    28206
2015    27825
2016    26899
2017    27249
2018    27484
2019    27991
2020    28520
Name: WSU, dtype: int64

In [93]:
tuition_df['WSU'][2012]

27302

In [96]:
tuition_df.loc[2012,"WSU"]

27302

If we had to create all of our dataframes by hand that would be remarkably inefficient. Instead, we will usually load in complete datasets stored as .csv files. These can either be stored on our computer locally or downloaded diectly from the internet. If we are loading data locally, we need to be careful to get the right directory. 

In [97]:
HWAS = pd.read_csv("./Height_Weight_Age_Sex.csv")

In [98]:
HWAS.head()

Unnamed: 0,height,weight,age,male
0,151.765,47.825606,63.0,1
1,139.7,36.485807,63.0,0
2,136.525,31.864838,65.0,0
3,156.845,53.041915,41.0,1
4,145.415,41.276872,51.0,0


In [103]:
HWAS.describe()

Unnamed: 0,height,weight,age,male
count,544.0,544.0,544.0,544.0
mean,138.263596,35.610618,29.344393,0.472426
std,27.602448,14.719178,20.746888,0.499699
min,53.975,4.252425,0.0,0.0
25%,125.095,22.007717,12.0,0.0
50%,148.59,40.057844,27.0,0.0
75%,157.48,47.209005,43.0,1.0
max,179.07,62.992589,88.0,1.0


In [114]:
HWAS[HWAS['male']==1]

Unnamed: 0,height,weight,age,male
0,151.7650,47.825606,63.0,1
3,156.8450,53.041915,41.0,1
5,163.8300,62.992589,35.0,1
7,168.9100,55.479971,27.0,1
9,165.1000,54.487739,54.0,1
11,151.1300,41.220173,66.0,1
15,163.1950,48.562694,36.0,1
16,157.4800,42.325803,44.0,1
18,121.9200,19.617854,12.0,1
21,161.2900,48.987936,39.0,1


In [116]:
HWAS[HWAS['male']==1]["age"].mean()

29.473346303501945

In [117]:
HWAS[HWAS['male']==0]["age"].mean()

29.22891986062718

In [106]:
cost_of_living = pd.read_csv("./COL.csv")

FileNotFoundError: File b'./COL.csv' does not exist

In [105]:
cost_of_living = pd.read_csv("./Data/COL.csv")

In [113]:
cost_of_living.tail()

Unnamed: 0,City,Cappuccino,Cinema,Wine,Gasoline,Avg Rent,Avg Disposable Income
211,Davao,0.79,1.9,3.17,0.84,554.18,158.34
212,Karachi,1.0,3.27,5.11,0.67,197.78,139.6
213,Lahore,1.23,3.27,6.54,0.66,206.08,132.95
214,Addis Ababa,0.46,2.29,4.18,0.72,653.77,124.22
215,Indore,0.91,2.23,6.03,0.84,205.15,120.68


In [107]:
GME = pd.read_csv("http://math.wsu.edu/faculty/ddeford/GME_Stock.csv")

In [108]:
GME

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2/4/2020,4.030000,4.250000,3.970000,4.070000,4.070000,3563100
1,2/5/2020,4.150000,4.410000,4.140000,4.180000,4.180000,2641700
2,2/6/2020,4.200000,4.300000,4.140000,4.140000,4.140000,1510300
3,2/7/2020,4.110000,4.130000,3.770000,3.810000,3.810000,2742300
4,2/10/2020,3.850000,4.100000,3.740000,3.940000,3.940000,2777000
5,2/11/2020,3.980000,4.240000,3.950000,4.020000,4.020000,3415000
6,2/12/2020,4.130000,4.510000,4.070000,4.190000,4.190000,4820600
7,2/13/2020,4.120000,4.260000,4.070000,4.110000,4.110000,2081700
8,2/14/2020,4.110000,4.190000,4.020000,4.020000,4.020000,1582700
9,2/18/2020,4.010000,4.080000,3.960000,4.060000,4.060000,1467600


In [109]:
GME.index = GME["Date"]

In [111]:
GME.head()

Unnamed: 0_level_0,Date,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2/4/2020,2/4/2020,4.03,4.25,3.97,4.07,4.07,3563100
2/5/2020,2/5/2020,4.15,4.41,4.14,4.18,4.18,2641700
2/6/2020,2/6/2020,4.2,4.3,4.14,4.14,4.14,1510300
2/7/2020,2/7/2020,4.11,4.13,3.77,3.81,3.81,2742300
2/10/2020,2/10/2020,3.85,4.1,3.74,3.94,3.94,2777000


In [112]:
GME["Open"]

Date
2/4/2020        4.030000
2/5/2020        4.150000
2/6/2020        4.200000
2/7/2020        4.110000
2/10/2020       3.850000
2/11/2020       3.980000
2/12/2020       4.130000
2/13/2020       4.120000
2/14/2020       4.110000
2/18/2020       4.010000
2/19/2020       4.060000
2/20/2020       4.160000
2/21/2020       4.120000
2/24/2020       3.900000
2/25/2020       3.770000
2/26/2020       3.580000
2/27/2020       3.230000
2/28/2020       3.340000
3/2/2020        3.600000
3/3/2020        3.880000
3/4/2020        3.710000
3/5/2020        3.700000
3/6/2020        3.840000
3/9/2020        3.590000
3/10/2020       3.940000
3/11/2020       4.140000
3/12/2020       3.700000
3/13/2020       4.130000
3/16/2020       3.930000
3/17/2020       4.400000
                 ...    
12/21/2020     15.810000
12/22/2020     16.219999
12/23/2020     20.170000
12/24/2020     21.010000
12/28/2020     21.309999
12/29/2020     20.820000
12/30/2020     19.379999
12/31/2020     19.250000
1/4/2021       19.00