# IM939 Lab 2 - Part 1

Last session we loaded data using [Pandas](https://pandas.pydata.org/). Here we explore how to use Pandas to read in, process and explore data.

As before we load pandas and use the read_csv method.

In [43]:
import pandas as pd

df = pd.read_csv('office_ratings.csv', encoding='UTF-8')

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       188 non-null    int64  
 1   episode      188 non-null    int64  
 2   title        188 non-null    object 
 3   imdb_rating  188 non-null    float64
 4   total_votes  188 non-null    int64  
 5   air_date     188 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 8.9+ KB


## Help!

Python has inbuilt documentation. To access this add a ? before an object or method.

For example, our dataframe

In [47]:
?df.info

The dtypes property (properties of obejct are values associated with the object and are not called with a () at the end).

In [48]:
?df.dtypes

The info method for dataframes.

In [49]:
?df.info

The below is quite long. But goes give you the various arguments (options) you can use with the method.

In [50]:
?pd.read_csv

The Pandas documentation is rather good. Relevent to our below work is:

* [What kind of data does pandas handle?](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented)
* [How to calculate summary statistics?](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented)
* [How to create plots in pandas?](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/04_plotting.html#min-tut-04-plotting)
* [How to handle time series data with ease?](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html#min-tut-09-timeseries)

I also found [a rather nice series of lessons a kind person put together](https://bitbucket.org/hrojas/learn-pandas/src/master/). There are lots of online tutorials which will help you.

## Structure

How do we find out the structure of our data?

Well, the variable df is now a pandas DataFrame object.

In [60]:
type(df)

pandas.core.frame.DataFrame

The DataFrame object has lots of built in methods and attributes.

The info method gives us information about datatypes, dimensions and the presence of null values in our dataframe.

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   season       188 non-null    int64  
 1   episode      188 non-null    int64  
 2   title        188 non-null    object 
 3   imdb_rating  188 non-null    float64
 4   total_votes  188 non-null    int64  
 5   air_date     188 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 8.9+ KB


We can just look at the datatypes if we want.

In [40]:
df.dtypes

season           int64
episode          int64
title           object
imdb_rating    float64
total_votes      int64
air_date        object
dtype: object

Or just the dimensions.

In [51]:
df.shape

(188, 6)

In this case, there are only 188 rows. But for larger datasets we might want to look at the head (top 5) and tail (bottom 5) rows.

In [52]:
df.head()

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date
0,1,1,Pilot,7.6,3706,2005-03-24
1,1,2,Diversity Day,8.3,3566,2005-03-29
2,1,3,Health Care,7.9,2983,2005-04-05
3,1,4,The Alliance,8.1,2886,2005-04-12
4,1,5,Basketball,8.4,3179,2005-04-19


In [53]:
 df.tail()

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date
183,9,19,Stairmageddon,8.0,1484,2013-04-11
184,9,20,Paper Airplane,8.0,1482,2013-04-25
185,9,21,Livin' the Dream,8.9,2041,2013-05-02
186,9,22,A.A.R.M.,9.3,2860,2013-05-09
187,9,23,Finale,9.7,7934,2013-05-16


We may need to put together a dataset.

Consider the two dataframe below

In [54]:
df_1 = pd.read_csv('office1.csv', encoding='UTF-8')
df_2 = pd.read_csv('office2.csv', encoding='UTF-8')

In [55]:
df_1.head()

Unnamed: 0,id,season,episode,imdb_rating
0,5-1,5,1,8.8
1,9-13,9,13,7.7
2,5-6,5,6,8.5
3,3-23,3,23,9.3
4,9-16,9,16,8.2


In [56]:
df_2.head()

Unnamed: 0,id,total_votes
0,4-10,2095
1,3-21,2403
2,7-24,2040
3,6-18,1769
4,8-8,1584


The total votes and imdb ratings data are split between files. There is a common column called id. We can join the two dataframes together using the common column.

In [57]:
inner_join_office_df = pd.merge(df_1, df_2, on='id', how='inner')
inner_join_office_df

Unnamed: 0,id,season,episode,imdb_rating,total_votes
0,5-1,5,1,8.8,2501
1,9-13,9,13,7.7,1394
2,5-6,5,6,8.5,2018
3,3-23,3,23,9.3,3010
4,9-16,9,16,8.2,1572
...,...,...,...,...,...
183,5-21,5,21,8.7,2032
184,2-13,2,13,8.3,2363
185,9-6,9,6,7.8,1455
186,2-2,2,2,8.2,2736


In this way you can combine datasets using common columns. We will leave that for the moment. If you want more information about merging data then see [this page](https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/#:~:text=Merge%20%28%29%20Function%20in%20pandas%20is%20similar%20to,rows%20from%20both%20data%20frames%2C%20specify%20how%3D%20%E2%80%98outer%E2%80%99.) and the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html).

## Summary

To get an overview of our data we can ask Python to 'describe our data'

In [61]:
df.describe()

Unnamed: 0,season,episode,imdb_rating,total_votes
count,188.0,188.0,188.0,188.0
mean,5.468085,11.87766,8.257447,2126.648936
std,2.386245,7.024855,0.538067,787.098275
min,1.0,1.0,6.7,1393.0
25%,3.0,6.0,7.9,1631.5
50%,6.0,11.5,8.2,1952.5
75%,7.25,18.0,8.6,2379.0
max,9.0,26.0,9.7,7934.0


or we can pull out specific statistics.

In [62]:
df.mean()

season            5.468085
episode          11.877660
imdb_rating       8.257447
total_votes    2126.648936
dtype: float64

or the sum.

In [50]:
df.sum()

season                                                      1028
episode                                                     2233
title          PilotDiversity DayHealth CareThe AllianceBaske...
imdb_rating                                               1552.4
total_votes                                               399810
air_date       2005-03-242005-03-292005-04-052005-04-122005-0...
dtype: object

Calculating these statistics for specific columns is straight forward.

In [63]:
df['imdb_rating'].mean()

8.257446808510643

In [52]:
df['total_votes'].sum()

399810

## Selecting subsets

We can even select more than one column!

In [68]:
df[['imdb_rating', 'total_votes']].mean()

imdb_rating       8.257447
total_votes    2126.648936
dtype: float64

Two sets of squared brackets is needed because you are passing a list of the column names to the getitem dunder method of the pandas dataframe object (thank [this stackoverflow question](https://stackoverflow.com/questions/11285613/selecting-multiple-columns-in-a-pandas-dataframe)). You can also check out the pandas documentation on [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics).

You can also select by row and column name using the iloc method. You can specify the [row, column]. So to choose the value in the 2nd row and 4th column.

In [69]:
df.iloc[2,4]

2983

All the rows or all the columns are indicated by :. Such as,

In [70]:
df.iloc[:,2]

0                 Pilot
1         Diversity Day
2           Health Care
3          The Alliance
4            Basketball
             ...       
183       Stairmageddon
184      Paper Airplane
185    Livin' the Dream
186            A.A.R.M.
187              Finale
Name: title, Length: 188, dtype: object

In [71]:
df.iloc[2,:]

season                   1
episode                  3
title          Health Care
imdb_rating            7.9
total_votes           2983
air_date        2005-04-05
Name: 2, dtype: object

We can use negative values in indexes to indicate 'from the end'. So, an index of [-10, :] returns the 10th from last row.

In [72]:
df.iloc[-10,:]

season                  9
episode                14
title           Vandalism
imdb_rating           7.6
total_votes          1402
air_date       2013-01-31
Name: 178, dtype: object

Instead of using tail, we could ask for the last 10 rows with an index of [-10:, :]. I read : as 'and everything else' in these cases.

In [74]:
df.iloc[-5:,:]

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date
183,9,19,Stairmageddon,8.0,1484,2013-04-11
184,9,20,Paper Airplane,8.0,1482,2013-04-25
185,9,21,Livin' the Dream,8.9,2041,2013-05-02
186,9,22,A.A.R.M.,9.3,2860,2013-05-09
187,9,23,Finale,9.7,7934,2013-05-16


In [75]:
df.tail()

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date
183,9,19,Stairmageddon,8.0,1484,2013-04-11
184,9,20,Paper Airplane,8.0,1482,2013-04-25
185,9,21,Livin' the Dream,8.9,2041,2013-05-02
186,9,22,A.A.R.M.,9.3,2860,2013-05-09
187,9,23,Finale,9.7,7934,2013-05-16


Note that the row is shown on the left. That will stop you getting lost in slices of the data. 

For the top ten rows

In [76]:
df.iloc[:10,:]

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date
0,1,1,Pilot,7.6,3706,2005-03-24
1,1,2,Diversity Day,8.3,3566,2005-03-29
2,1,3,Health Care,7.9,2983,2005-04-05
3,1,4,The Alliance,8.1,2886,2005-04-12
4,1,5,Basketball,8.4,3179,2005-04-19
5,1,6,Hot Girl,7.8,2852,2005-04-26
6,2,1,The Dundies,8.7,3213,2005-09-20
7,2,2,Sexual Harassment,8.2,2736,2005-09-27
8,2,3,Office Olympics,8.4,2742,2005-10-04
9,2,4,The Fire,8.4,2713,2005-10-11


Of course, we can run methods on these slices. We could, if we wanted to, calculate the mean imdb rating of only the first and last 100 episodes. _Note_ the indexing starts at 0 so we want the column index of 3 (0:season, 1:episode, 2:title, 3:imdb_rating).

In [77]:
df.iloc[:100,3].mean()

8.483000000000006

In [61]:
df.iloc[-100:,3].mean()

8.062000000000001

If you are unsure how many rows you have then the count method comes to the rescue.

In [62]:
df.iloc[-100:,3].count()

100

In [81]:
df.describe()

Unnamed: 0,season,episode,imdb_rating,total_votes
count,188.0,188.0,188.0,188.0
mean,5.468085,11.87766,8.257447,2126.648936
std,2.386245,7.024855,0.538067,787.098275
min,1.0,1.0,6.7,1393.0
25%,3.0,6.0,7.9,1631.5
50%,6.0,11.5,8.2,1952.5
75%,7.25,18.0,8.6,2379.0
max,9.0,26.0,9.7,7934.0


So it looks like the last 100 episodes were less good than the first 100. I guess that is why it was cancelled.

Our data is organised by season. Looking at the average by season might help.

In [63]:
df.groupby('season').mean()

Unnamed: 0_level_0,episode,imdb_rating,total_votes
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3.5,8.016667,3195.333333
2,11.5,8.436364,2630.636364
3,12.0,8.573913,2443.173913
4,7.5,8.6,2422.571429
5,13.5,8.492308,2150.730769
6,13.5,8.219231,1856.538462
7,12.5,8.316667,2030.958333
8,12.5,7.666667,1546.375
9,12.0,7.956522,1852.608696


The above line groups our dataframe by values in the season column and then displays the mean for each group. Pretty nifty.

Season 8 looks pretty bad. We can look at just the rows for season 8.

In [84]:
df[df['season'] == 8]

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date
141,8,1,The List,8.2,1829,2011-09-22
142,8,2,The Incentive,8.2,1668,2011-09-29
143,8,3,Lotto,7.3,1601,2011-10-06
144,8,4,Garden Party,8.1,1717,2011-10-13
145,8,5,Spooked,7.6,1543,2011-10-27
146,8,6,Doomsday,7.8,1476,2011-11-03
147,8,7,Pam's Replacement,7.7,1563,2011-11-10
148,8,8,Gettysburg,7.0,1584,2011-11-17
149,8,9,Mrs. California,7.7,1553,2011-12-01
150,8,10,Christmas Wishes,8.0,1547,2011-12-08


In [82]:
df['season'] == 8

0      False
1      False
2      False
3      False
4      False
       ...  
183    False
184    False
185    False
186    False
187    False
Name: season, Length: 188, dtype: bool

Generally pretty bad, but there is clearly one very disliked episode.

## Adding columns

We can add new columns pretty simply.

In [85]:
df['x'] = 44
df.head()

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date,x
0,1,1,Pilot,7.6,3706,2005-03-24,44
1,1,2,Diversity Day,8.3,3566,2005-03-29,44
2,1,3,Health Care,7.9,2983,2005-04-05,44
3,1,4,The Alliance,8.1,2886,2005-04-12,44
4,1,5,Basketball,8.4,3179,2005-04-19,44


Our new column can be an operation on other columns

In [87]:
df['rating_div_total_votes'] = df['imdb_rating'] / df['total_votes']
df.head()

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date,x,rating_div_total_votes
0,1,1,Pilot,7.6,3706,2005-03-24,44,0.002051
1,1,2,Diversity Day,8.3,3566,2005-03-29,44,0.002328
2,1,3,Health Care,7.9,2983,2005-04-05,44,0.002648
3,1,4,The Alliance,8.1,2886,2005-04-12,44,0.002807
4,1,5,Basketball,8.4,3179,2005-04-19,44,0.002642


or as simple as adding one to every value.

In [88]:
df['y'] = df['season'] + 1
df.iloc[0:5,:]

Unnamed: 0,season,episode,title,imdb_rating,total_votes,air_date,x,rating_div_total_votes,y
0,1,1,Pilot,7.6,3706,2005-03-24,44,0.002051,2
1,1,2,Diversity Day,8.3,3566,2005-03-29,44,0.002328,2
2,1,3,Health Care,7.9,2983,2005-04-05,44,0.002648,2
3,1,4,The Alliance,8.1,2886,2005-04-12,44,0.002807,2
4,1,5,Basketball,8.4,3179,2005-04-19,44,0.002642,2


In [89]:
y =  df['season'] + 1

## Writing data

Pandas supports writing out data frames to various formats.

In [68]:
?df.to_csv
df.to_csv('my_output_ratings.csv', encoding='UTF-8')

In [69]:
?df.to_excel
df.to_excel('my_output_ratings.xlsx', encoding='UTF-8')