# Lecture 6 : Dataframe

* Pandas is an open source Python library for data analysis. It is very powerful toolkit for reading, filtering, manipulating and exporting data.
* Since Pandas is not part of the Python standard library, you have to first tell Python to load the library.
* When working with Pandas functions, it is common practice to give pandas the alias pd.

In [1]:
import pandas as pd

## 1. Series

* The Series is a one-dimensional container, similar to the built-in Python list. It is the data type that represents each column of the DataFrame.
* The easiest way to create a Series is to pass in a Python list. If we pass in a list of mixed types, the most common representation of both will be used. Typically the dtype will be object.

In [2]:
s = pd.Series(['apple', 10])


* Notice on the left that the "row number" is shown. This is actually the **index** for the series. It is similar to the row name and row index for dataframes. It implies that we can actually assign a "name" to values in our series.

In [3]:
covid = pd.Series([79293924, 42631421, 27425743, 21622265, 18266015], index=['USA', 'India', 'Brazil', 'France', 'UK'])


* There are many attributes and methods associated with a **Series** object. Two examples of attributes are **index** and **values**. 

In [5]:
covid.index

Index(['USA', 'India', 'Brazil', 'France', 'UK'], dtype='object')

In [7]:
covid.values

array([79293924, 42631421, 27425743, 21622265, 18266015])

* You can get a summarize of statistics of the series by .describe().

In [4]:
covid.describe()

count    5.000000e+00
mean     3.784787e+07
std      2.497998e+07
min      1.826602e+07
25%      2.162226e+07
50%      2.742574e+07
75%      4.263142e+07
max      7.929392e+07
dtype: float64

* What if we wanted to subset the series by identifying those of the mean?

In [11]:
covid.describe()[0]

5.0

## 2. DataFrame

* Dataframe can be created through the combination of **key** - **values**.
* The **key** represents the column namem and the **values** are the contents of the column.

In [12]:
df = pd.DataFrame({'country': ['USA', 'India', 'Brazil', 'France', 'UK'], 'confirmed': [79293924, 42631421, 27425743, 21622265, 18266015],
                          'population': [333214298, 1387953387, 214341966, 67813000, 67081234]})

* Every DataFrame object has a shape attribute that will give us the number of rows and columns of the DataFrame.

In [13]:
df.shape

(5, 3)

* To get the list of which information it contains, we look at the columns.

In [14]:
df.columns

Index(['country', 'confirmed', 'population'], dtype='object')

* You can check the data types of each column by using the dtypes attribute.

In [15]:
df.dtypes

country       object
confirmed      int64
population     int64
dtype: object

### Loading dataset

* With the pandas library loaded, we can use the read_csv function to load a CSV data file.
* You can also load different types of data like JSON, HTML, EXCEL, SAS, etc.
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [16]:
mlb_winloss = pd.read_csv('./mlb_winloss.csv')

######## For Colab users ########
#import io
#from google.colab import files
#uploaded = files.upload()
#mlb_winloss = pd.read_csv(io.StringIO(uploaded['mlb_winloss.csv'].decode('utf-8')))

* A DataFrame is similar to Excel workbook tabular datasheet.

In [17]:
mlb_winloss.head()

Unnamed: 0,Rank,Team,Won,Lost,Tied,First MLB Season,Division
0,1,New York Yankees,10621,8090,93,1901,AL East
1,2,San Francisco Giants,11301,9773,163,1883,NL West
2,3,Los Angeles Dodgers,11123,9891,139,1884,NL West
3,4,St. Louis Cardinals,11038,10163,152,1882,NL Central
4,5,Boston Red Sox,9718,9014,83,1901,AL East


### Subsetting columns and rows

* Today's data often has too many cells to make sense of all the printed information. Instead, the best way to look at our data is to inspect it in parts by looking at various subsets of the data.
* We already saw that we can use the **head** method of a dataframe to look at the first five rows of our data. This is useful to see if our data loaded properly and to get a sense of each of the columns, its name, and its contents.
* Sometimes, however, we may want to see only particular rows, columns, or values from our data.

* If we want only a specific column from our data, we can access the data using square brackets.

In [18]:
mlb_winloss['Team']

0          New York Yankees
1      San Francisco Giants
2       Los Angeles Dodgers
3       St. Louis Cardinals
4            Boston Red Sox
5              Chicago Cubs
6       Cleveland Guardians
7           Cincinnati Reds
8            Detroit Tigers
9         Chicago White Sox
10       Pittsburgh Pirates
11           Atlanta Braves
12       Los Angeles Angels
13           Houston Astros
14        Toronto Blue Jays
15        Oakland Athletics
16     Washington Nationals
17     Arizona Diamondbacks
18        Milwaukee Brewers
19           Tampa Bay Rays
20          Minnesota Twins
21            New York Mets
22       Kansas City Royals
23            Texas Rangers
24        Baltimore Orioles
25         Seattle Mariners
26    Philadelphia Phillies
27         Colorado Rockies
28         San Diego Padres
29            Miami Marlins
Name: Team, dtype: object

* To specify multiple columns by the column name, we need to pass in a list between the square brackets

In [20]:
mlb_winloss[['Team', 'Won']]

Unnamed: 0,Team,Won
0,New York Yankees,10621
1,San Francisco Giants,11301
2,Los Angeles Dodgers,11123
3,St. Louis Cardinals,11038
4,Boston Red Sox,9718
5,Chicago Cubs,11087
6,Cleveland Guardians,9592
7,Cincinnati Reds,10713
8,Detroit Tigers,9446
9,Chicago White Sox,9411


* We can use the loc attribute on the dataframe to subset rows based on the index label.

In [21]:
mlb_winloss.loc[mlb_winloss['Team']=='Miami Marlins']

Unnamed: 0,Rank,Team,Won,Lost,Tied,First MLB Season,Division
29,30,Miami Marlins,2088,2438,0,1993,NL East


* iloc is used to subset by the row index number.

In [24]:
mlb_winloss.iloc[2, 3]

'9,891'

* If we just want to get the second and third column using the loc or iloc syntax, we can write

In [28]:
mlb_winloss.iloc[:, [1,2]]
mlb_winloss.loc[:, [mlb_winloss.columns[1],mlb_winloss.columns[2]]]


Unnamed: 0,Team,Won
0,New York Yankees,10621
1,San Francisco Giants,11301
2,Los Angeles Dodgers,11123
3,St. Louis Cardinals,11038
4,Boston Red Sox,9718
5,Chicago Cubs,11087
6,Cleveland Guardians,9592
7,Cincinnati Reds,10713
8,Detroit Tigers,9446
9,Chicago White Sox,9411


* You can subset columns through generating range list.

In [37]:
mlb_winloss.iloc[:, range(5)]

Unnamed: 0,Rank,Team,Won,Lost,Tied
0,1,New York Yankees,10621,8090,93
1,2,San Francisco Giants,11301,9773,163
2,3,Los Angeles Dodgers,11123,9891,139
3,4,St. Louis Cardinals,11038,10163,152
4,5,Boston Red Sox,9718,9014,83
5,6,Chicago Cubs,11087,10521,161
6,7,Cleveland Guardians,9592,9144,91
7,8,Cincinnati Reds,10713,10501,139
8,9,Detroit Tigers,9446,9311,93
9,10,Chicago White Sox,9411,9309,103


* Same as the **Series**, we can subset a dataframe with a boolean subsetting.

* You can insert a new column in the dataframe.

In [39]:
mlb_winloss['newcol'] = 0

### Describe your data

* describe() is used to view some basic statistical details like percentile, mean, std etc. of a dataframe.

In [40]:
mlb_winloss.describe()

Unnamed: 0,Rank,Tied,First MLB Season,newcol
count,30.0,30.0,30.0,30.0
mean,15.5,65.7,1930.466667,0.0
std,8.803408,63.2718,44.589107,0.0
min,1.0,0.0,1876.0,0.0
25%,8.25,3.0,1888.25,0.0
50%,15.5,85.0,1901.0,0.0
75%,22.75,113.75,1969.0,0.0
max,30.0,163.0,1998.0,0.0


* Where are Won, Lost, Total Games? 

## Exercise

1. Load(read) the automobile data (automobile_data.csv)

In [41]:
# your code here

dataset = pd.read_csv('./automobile_data.csv')

2. Print the first five rows.

In [42]:
# your code here

dataset.head()

Unnamed: 0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0


3. Print the last ten rows.

In [43]:
# your code here

dataset.head(10)

Unnamed: 0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0
5,audi,sedan,99.8,177.3,ohc,five,110,19,15250.0
6,audi,wagon,105.8,192.7,ohc,five,110,19,18920.0
7,bmw,sedan,101.2,176.8,ohc,four,101,23,16430.0
8,bmw,sedan,101.2,176.8,ohc,four,101,23,16925.0
9,bmw,sedan,101.2,176.8,ohc,six,121,21,20970.0


4. How many rows and columns of the DataFrame?

In [44]:
# your code here

dataset.shape

(61, 9)

5. Print the 10th and 20th rows of the DataFrame.

In [45]:
# your code here

dataset.iloc[[9,19]]

Unnamed: 0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
9,bmw,sedan,101.2,176.8,ohc,six,121,21,20970.0
19,honda,sedan,96.5,175.4,ohc,four,101,24,12945.0


6. Print the columes (from 5th to 7th one).

In [46]:
# your code here

dataset.iloc[:, [4,6]]

Unnamed: 0,engine-type,horsepower
0,dohc,111
1,dohc,111
2,ohcv,154
3,ohc,102
4,ohc,115
...,...,...
56,ohc,85
57,ohc,52
58,ohc,100
59,ohc,114


7. How many automobiles does have horsepower with greater than 100?

In [52]:
# your code here

print(dataset[dataset['horsepower']>100].shape[0])


29

8. How many automobiles are cheaper than 6000?

In [54]:
# your code here

print(dataset[dataset['price']<6000].shape[0])

4


## References
* Chen, D. Y. (2017). Pandas for everyone: Python data analysis. Addison-Wesley Professional.
* Data Analysis with Python: https://www.coursera.org/learn/data-analysis-with-python
* https://pandas.pydata.org/