# Lecture 1: Introduction to Pandas

In this interactive lab, you will learn a bit about how the powerful open-source library, ```pandas```. ```Pandas``` is an industry-standard library that deals with *structured* data (e.g., tables, dataframes). This will be the first of three lectures that will be dedicated towards learning ```pandas`` with increasing complexity. The coding goals of today's lecture are to understand:

1. Introduction to ```pandas```
2. What are crucial data structures such as Series and DataFrames?
3. How to extract data from data structures

Let's begin!

## Introduction to ```pandas```

```pandas``` is a library for working with data. It generally supports importation, cleaning, merging, and transformation of data. One of the benefits of the library is that it integrates with many other standard python librarys such as ```numpy``` (for arithmetic calculations), ```matplotlib``` (for visualization) and ```scikit-learn``` (for data analysis). 

First, you will need to import the library. Below is the standard importation for pandas (using the alias pd):

In [17]:
import pandas as pd

## Crucial Data Structures

In ```pandas```, tables are called DataFrames. Each DataFrame is comprised of columns of data which are denoted Series.

### Series

The simplest data structure in ```pandas``` is a Series, which is essentially a column in a table (or a 1D array) that holds data.  You can create a Series from a list of data or otherwise:

In [18]:
my_list = [4, 2, 25]
my_Series = pd.Series(my_list)
print(my_Series)

0     4
1     2
2    25
dtype: int64


The leftmost column indicates the index of the Series (called labels), while the right column contains the data itself. You can use the index to print an element of the Series

In [19]:
print(my_Series[2])

25


However, your labels do not need to be the default 0, 1, 2. You can label your Series for clarity:

In [20]:
my_labeled_Series = pd.Series(my_list,index = ["Year in College", "# of majors", "Age"])
my_labeled_Series

Year in College     4
# of majors         2
Age                25
dtype: int64

Now you can access the elements of your Series using their label istead of an index number. How can you access the third row of the table?

Although in the above example, we are using numbers (integers) in the Series, pretty much any data type can be used. Let's create a Series called DSC100_Names, which contains the names of everyone in the class

In [21]:
DSC100_Names=pd.Series(["anna","bob","Charlie","duncan"])

Series can also be created from dictionaries:

In [22]:
distance_ran = {"Week1": 8.7, "Week2": 10.2, "Week3": 11.1}
running_Series = pd.Series(distance_ran)
running_Series

Week1     8.7
Week2    10.2
Week3    11.1
dtype: float64

You can also create a Series using only specified data points. For example, if we want a Series that only has Week 1 and Week 3 miles ran, we can do the following:

In [23]:
truncated_running_Series = pd.Series(distance_ran, index=["Week1","Week3"])
truncated_running_Series

Week1     8.7
Week3    11.1
dtype: float64

### DataFrames

Essentially, DataFrames are tables of data. From a coding perspective in ```pandas```, they can be thought of as a collection of Series put together, each of which share the same index. 

Let's begin by creating a series called DSC100_Years, which contains the information of what year each person in the DSC 100 class is in. 

In [24]:
DSC100_Years = pd.Series([2, 2, 2 ,4])

#### Building DataFrame from Series

Now, if we can put the names and years together into a table, this will be a DataFrame:

In [25]:
DSC100_df = pd.DataFrame([DSC100_Names,DSC100_Years])

In [26]:
DSC100_df

Unnamed: 0,0,1,2,3
0,anna,bob,Charlie,duncan
1,2,2,2,4


Hmmm....Maybe that's not exactly what we want. I wanted each row to represent a student. How can we do this? 
- Option 1: concatenate our series
- Option 2: create a dictionary from our series 

In [30]:
pd.concat([DSC100_Names,DSC100_Years],axis=1)

Unnamed: 0,0,1
0,anna,2
1,bob,2
2,Charlie,2
3,duncan,4


#### Creating DataFrame from Dictionary

We can first create a dictionary that has the keys Names and Years, then create the DataFrame from that.

In [28]:
DSC100_df_option2 = pd.DataFrame({"Names":DSC100_Names,"Years":DSC100_Years})
DSC100_df_option2

Unnamed: 0,Names,Years
0,anna,2
1,bob,2
2,Charlie,2
3,duncan,4


#### Reading in a DataFrame from a csv 

```pandas``` has the ability to read csv files and automatically put them into DataFrame form. Below, we read in the csv file 'Advertising.csv' as a DataFrame.

The Advertising CSV contains 4 columns of information: $\$$ spent on TV advertising, $\$$ spent on radio advertising, $\$$ spent on newspaper adverstising and $\$$ in sales.

In [33]:
Advertising_df = pd.read_csv('Advertising.csv')
Advertising_df

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,9.7
197,177.0,9.3,6.4,12.8
198,283.6,42.0,66.2,25.5


Let's show another example. Read in the Auto.csv which represents information about different cars.

In [34]:
Car_df = pd.read_csv('Auto.csv')
Car_df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
392,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
393,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
394,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
395,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


It probably makes more sense to have the last column (name) be the index (instead of numbers 0, 1, 2, etc.). We can do that in ```pandas``` too! In fact, we can do it in two different ways:

1. set_index command (Car_df.set_index("index name"))
2. specify when reading in the csv file which column should be the index column

We'll look at the second option below.

In [37]:
Car_df = pd.read_csv('Auto.csv',index_col="name")
Car_df

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1
...,...,...,...,...,...,...,...,...
ford mustang gl,27.0,4,140.0,86,2790,15.6,82,1
vw pickup,44.0,4,97.0,52,2130,24.6,82,2
dodge rampage,32.0,4,135.0,84,2295,11.6,82,1
ford ranger,28.0,4,120.0,79,2625,18.6,82,1


Just like in Series, we can also use indexing to access information. Note that index does not have to be unique! Try it -- you can create a car dataframe using year as the index (which is not unique). If you later regret this, you can always use reset_index() to put it back to the original dataframe.

However, **columns** should always have unique names.

#### Information about DataFrames

You may want to know the names of the rows/columns of a data frame, or it's shape. The shape will tell you how many data points you have (number of rows)

In [39]:
Car_df.index

Index(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite',
       'amc rebel sst', 'ford torino', 'ford galaxie 500', 'chevrolet impala',
       'plymouth fury iii', 'pontiac catalina', 'amc ambassador dpl',
       ...
       'chrysler lebaron medallion', 'ford granada l', 'toyota celica gt',
       'dodge charger 2.2', 'chevrolet camaro', 'ford mustang gl', 'vw pickup',
       'dodge rampage', 'ford ranger', 'chevy s-10'],
      dtype='object', name='name', length=397)

In [40]:
Car_df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin'],
      dtype='object')

In [41]:
Car_df.shape

(397, 8)

## Extracting Data from DataFrames

A lot of times we really want to extract information from DataFrames so that we can analyze it. There are many different ways in which you can extract information (e.g., by columns, or by position). We'll look at each case

### Label-Based Extraction

We can extract information using the ```loc``` command. For the loc command, you need to specify both the row and column information. You will need to specify the *labels* of the row and column. For example, if we want to extract the 10th example in the sales column of the Advertising dataframe

In [55]:
Advertising_df.loc[9,"sales"]

10.6

You can also do a list of extractions for the ```loc``` command. In that case, you put lists of the information you want. For example, if we want information about the number of cylinders, the year , and the mpg for the plymouth fury iii and the ford gran torino, we can do the following:

In [58]:
Car_df.loc[["plymouth fury iii","ford gran torino"],["cylinders","year","mpg"]]

Unnamed: 0_level_0,cylinders,year,mpg
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
plymouth fury iii,8,70,14.0
plymouth fury iii,8,71,14.0
plymouth fury iii,8,72,15.0
ford gran torino,8,73,14.0
ford gran torino,8,74,16.0
ford gran torino,8,76,14.5


You can also use slicing as a way to examine multiple columns or rows of data. Let's examine up to newspaper (so not including sales). We only want to look at the last 20 rows of data

In [61]:
Advertising_df.loc[[1,25,184],:"newspaper"]

Unnamed: 0,TV,radio,newspaper
1,44.5,39.3,45.1
25,262.9,3.5,19.5
184,253.8,21.3,30.0


We can also extract all rows or all columns of the dataset. For example, we can extract the sales column of the Advertising dataset:

In [None]:
Advertising_df.loc[:,"sales"]

### Integer-Based Extraction

If, instead, you want to extract data based on the integer location, you can use ```iloc```. In this case, we will specify the integers (instead of the names) that we want to extract. For example, with the car dataset

In [62]:
Car_df.iloc[1,3]

'165'

We can also do lists

In [63]:
Car_df.iloc[[0,1,3],[1,2,5]]

Unnamed: 0_level_0,cylinders,displacement,acceleration
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chevrolet chevelle malibu,8,307.0,12.0
buick skylark 320,8,350.0,11.5
amc rebel sst,8,304.0,12.0


...and slices

In [65]:
Car_df.iloc[5:10,0:4]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ford galaxie 500,15.0,8,429.0,198
chevrolet impala,14.0,8,454.0,220
plymouth fury iii,14.0,8,440.0,215
pontiac catalina,14.0,8,455.0,225
amc ambassador dpl,15.0,8,390.0,190


...or entire rows or columns. To extract the mpg column, we note that it is in the index position 0

In [66]:
Car_df.iloc[:,0]

name
chevrolet chevelle malibu    18.0
buick skylark 320            15.0
plymouth satellite           18.0
amc rebel sst                16.0
ford torino                  17.0
                             ... 
ford mustang gl              27.0
vw pickup                    44.0
dodge rampage                32.0
ford ranger                  28.0
chevy s-10                   31.0
Name: mpg, Length: 397, dtype: float64

## Putting it all together!

### Question 1. 

Consider the Advertising dataset. A natural question might be to ask if advertising in media such as tv, newspaper, and radio has any effect on company sales. You will select **one** media (either tv, newspaper, or radio) and create a scatter plot with the money spend on advertising on the x-axis and the company sales on the y axis. To do this, use the following steps:

1. extract your selected media column and name it x
2. extract teh sales column and name it y
3. using matplotlib.pyplot, create the scatter plot

Don't forget to label your axes

In [None]:
import matplotlib.pyplot as plt



Looking at your created plot, do you think there is a strong or weak relationship between advertising and sales? Is it a positive or negative relationship? Why do you think that is?

### Question 2.

Consider the Auto dataset. To remind yourself of the variables, print out the column names 

Looking at your column names, what are two columns that you hypothesize might have a relationship between one another? Why do you hypothesize that a relationship exists between those two variables? What do you anticipate the relationship to look like?

To test your hypothesis,

1. extract the column that you expect to be independent name it x
2. extract the column that is dependent and name it y
3. using matplotlib.pyplot, create the scatter plot

Don't forget to label your axes!

Looking at your created plot, do you think there is a strong or weak relationship between your variables? Is it a positive or negative relationship? Why do you think that is?