# Intro to data analysis using Python

The data structures native to Python (lists, dictionaries) are simple but not very useful or efficient for data analysis. The libraries <code>Numpy</code> and <code>pandas</code> provide some new data structures that are more helpful. 

## Series
The first data structure is a Series, which is like a one-dimensional array that can hold any data type.
You can create a Series in a few different ways.

### Creating Series

In [129]:
import numpy as np
import pandas as pd # This means we can refer to the pandas module as pd
my_list = [7, 8.5, 5, 2, 9]
series1 = pd.Series(my_list) # Creating a Series by passing in a Python list
series1

0    7.0
1    8.5
2    5.0
3    2.0
4    9.0
dtype: float64

Notice that the series is printed with the actual data values in the right column, with the indices on the left side. We can access members or slices of the series using the index values in the left column.

In [130]:
series1[0]

7.0

In [131]:
series1[2:] # Slice from index 2 to the end

2    5.0
3    2.0
4    9.0
dtype: float64

You can also create a series with the values labeled by a name, rather than an index.

In [132]:
quiz_series = pd.Series(my_list, index = ["Mary", "John", "Richard", "Bob", "Kate"])
quiz_series

Mary       7.0
John       8.5
Richard    5.0
Bob        2.0
Kate       9.0
dtype: float64

In [133]:
quiz_series["Mary"]

7.0

In [134]:
quiz_series[["Mary", "Bob", "John"]]

Mary    7.0
Bob     2.0
John    8.5
dtype: float64

In [135]:
velocities = pd.Series({"Mahomes": 60, "Kizer": 62, "Kaaya": 53, "Trubisky": 55, "Watson": 49}) # creating a Series with a dict
velocities

Kaaya       53
Kizer       62
Mahomes     60
Trubisky    55
Watson      49
dtype: int64

### Manipulating Series
You can manipulate the data in a Series in a variety of ways. Operations on a Series will perform the operation on every element in the series.

In [136]:
velocities + 10

Kaaya       63
Kizer       72
Mahomes     70
Trubisky    65
Watson      59
dtype: int64

In [137]:
velocities > 50

Kaaya        True
Kizer        True
Mahomes      True
Trubisky     True
Watson      False
dtype: bool

The Series data structure is an object that contains functions you can use to get information about the data or transform the data.

In [138]:
velocities.max()

62

In [139]:
velocities.median()

55.0

In [140]:
velocities.describe() # Some basic summary statistics

count     5.000000
mean     55.800000
std       5.263079
min      49.000000
25%      53.000000
50%      55.000000
75%      60.000000
max      62.000000
dtype: float64

## Data Frames
A higher-dimensional data structure in pandas is the DataFrame. You can think of DataFrames as a spreadsheet of data, or as a dictionary of Series objects. Usually, the columns in a dataset will represent the variables and the rows will represent the observations.

In [141]:
quiz2_series = pd.Series([7, 4, 6.5, 8.5], index = ["Mary", "John", "Richard", "Kate"]) # A new Series - Bob has been omitted
quiz_scores = pd.DataFrame({"quiz1": quiz_series, "quiz2": quiz2_series}) # Passing in a dict of Series
quiz_scores

Unnamed: 0,quiz1,quiz2
Bob,2.0,
John,8.5,4.0
Kate,9.0,8.5
Mary,7.0,7.0
Richard,5.0,6.5


The NaN (Not a Number) value for Bob's quiz 2 indicates that there is a missing value there. It is equivalent to a null value.

You can access values in the DataFrame using the names of the columns, which will return a Series. You can then access values within those Series using the row names.

In [142]:
quiz_scores["quiz1"] # Returns a series

Bob        2.0
John       8.5
Kate       9.0
Mary       7.0
Richard    5.0
Name: quiz1, dtype: float64

In [143]:
quiz_scores["quiz2"]["Mary"] # Returns a value

7.0

We can insert or change values in DataFrames in a similar way. Let's say we found out Richard cheated on his first quiz and we want to edit the value to be 0. Let's also fill in Bob's second quiz score.

In [144]:
quiz_scores["quiz1"]["Richard"] = 0
quiz_scores["quiz2"]["Bob"] = 7
quiz_scores

Unnamed: 0,quiz1,quiz2
Bob,2.0,7.0
John,8.5,4.0
Kate,9.0,8.5
Mary,7.0,7.0
Richard,0.0,6.5


We can also slice DataFrames using conditionals.

In [145]:
q1_mean = quiz_scores["quiz1"].mean() # Find the mean of the quiz1 Series
quiz_scores[quiz_scores["quiz1"] > q1_mean] # Get a slice of the DataFrame such that all observations have quiz1 scores above the quiz1 average

Unnamed: 0,quiz1,quiz2
John,8.5,4.0
Kate,9.0,8.5
Mary,7.0,7.0


## Some real data

Clearly, the majority of the time we will have a dataset we want to import into Python instead of manually inputting it as we have done. A very widely used format is csv (comma separated values). The <code>read_csv()</code> function will take a path to a csv file and read it into a pandas DataFrame. Here we will investigate some demographic and employment data. The meanings of variables and values can be found in the [Census codebook](https://usa.ipums.org/usa/resources/codebooks/2000_PUMS_codebook.pdf).

In [146]:
census_df = pd.read_csv("5pCensus2000.csv") # A 5% sample of U.S. Census data from the year 2000

In [147]:
census_df.head() # View the first few rows of observations

Unnamed: 0,YEAR,DATANUM,SERIAL,HHWT,GQ,PERNUM,PERWT,SEX,AGE,MARST,EMPSTAT,EMPSTATD,LABFORCE,OCC1990,WKSWORK1,INCTOT,FTOTINC
0,2000,3,1,600,1,1,618,2,66,5,1,10,2,95,46,18000,54700
1,2000,3,1,600,1,2,684,1,40,6,1,10,2,507,50,36700,54700
2,2000,3,2,600,1,1,618,1,51,1,1,10,2,637,52,54000,56900
3,2000,3,2,600,1,2,609,2,48,1,3,30,1,376,1,900,56900
4,2000,3,2,600,1,3,621,1,19,6,1,10,2,885,16,2000,56900


In [148]:
census_df.describe() # Summary statistics for each variable

Unnamed: 0,YEAR,DATANUM,SERIAL,HHWT,GQ,PERNUM,PERWT,SEX,AGE,MARST,EMPSTAT,EMPSTATD,LABFORCE,OCC1990,WKSWORK1,INCTOT,FTOTINC
count,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0,371618.0
mean,2000.0,3.0,78898.482759,718.124391,1.001383,2.148871,736.324842,1.517887,37.376529,3.517696,1.355373,13.59129,1.285702,606.904967,24.819282,2074918.0,78053.13
std,0.0,0.0,45567.880939,453.098272,0.042056,1.319176,478.306843,0.499681,22.600203,2.324678,1.103816,11.03055,0.802092,365.813444,24.261673,4027032.0,403624.0
min,2000.0,3.0,1.0,150.0,1.0,1.0,75.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,0.0,-17798.0,-19998.0
25%,2000.0,3.0,39520.25,600.0,1.0,1.0,539.0,1.0,18.0,1.0,1.0,10.0,1.0,275.0,0.0,9732.5,25201.25
50%,2000.0,3.0,78810.5,600.0,1.0,2.0,610.0,2.0,37.0,4.0,1.0,10.0,2.0,567.0,20.0,27125.0,47555.0
75%,2000.0,3.0,118361.0,600.0,1.0,3.0,665.0,2.0,53.0,6.0,3.0,30.0,2.0,999.0,52.0,79600.0,78000.0
max,2000.0,3.0,157986.0,2500.0,5.0,18.0,5419.0,2.0,94.0,6.0,3.0,30.0,2.0,999.0,52.0,9999999.0,9999999.0


In [149]:
census_df[census_df["INCTOT"] < 50000] # Get observations for which total personal income was less than $50,000

Unnamed: 0,YEAR,DATANUM,SERIAL,HHWT,GQ,PERNUM,PERWT,SEX,AGE,MARST,EMPSTAT,EMPSTATD,LABFORCE,OCC1990,WKSWORK1,INCTOT,FTOTINC
0,2000,3,1,600,1,1,618,2,66,5,1,10,2,95,46,18000,54700
1,2000,3,1,600,1,2,684,1,40,6,1,10,2,507,50,36700,54700
3,2000,3,2,600,1,2,609,2,48,1,3,30,1,376,1,900,56900
4,2000,3,2,600,1,3,621,1,19,6,1,10,2,885,16,2000,56900
5,2000,3,2,600,1,4,559,2,17,6,3,30,1,999,0,0,56900
6,2000,3,3,600,1,1,615,1,47,4,2,20,2,373,24,10000,10000
7,2000,3,4,600,1,1,635,1,53,3,3,30,1,486,20,46100,46100
8,2000,3,5,600,1,1,603,1,42,1,3,30,1,65,11,4000,43000
9,2000,3,5,600,1,2,617,2,46,1,1,10,2,234,52,28000,43000
10,2000,3,5,600,1,3,589,1,21,6,2,20,2,991,0,0,43000


In this data, the value 9999999 is used to indicate a missing value for personal income (INCTOT). Let's convert those values to NaN values so Python will recognize those as missing values instead of $9,999,999.

In [151]:
census_df["INCTOT"] = census_df["INCTOT"].replace(9999999, np.NaN)

Now you are ready to perform some more detailed analysis on the data! There are many Python libraries available for data analysis that allow you to operate on DataFrames and Series. One you can check out is <code>scikit-learn</code> - you can try out creating linear models between some of the variables and running a regression. 