<a href="https://colab.research.google.com/github/aliawofford9317/LSAMP_Python_Course2024/blob/Lucilla/Copy_of_Lesson_3b_Pandas_Key_Concepts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas deep dive
Lets review some of the key concepts for Pandas and Numpy that we will use during this course. First lets import our libraries, Pandas.

In [1]:
import pandas as pd

In [2]:
def fetch_data():
  import os, shutil
  cwd = os.getcwd()
  if os.path.exists("LSAMP_Python_Course2024"):
    shutil.rmtree("LSAMP_Python_Course2024")
  !git clone https://github.com/aliawofford9317/LSAMP_Python_Course2024.git
  for file in os.listdir("LSAMP_Python_Course2024"):
    if file.endswith((".txt",".csv")):
      shutil.copy("LSAMP_Python_Course2024/{}".format(file),cwd)
fetch_data()

Cloning into 'LSAMP_Python_Course2024'...
remote: Enumerating objects: 174, done.[K
remote: Counting objects: 100% (104/104), done.[K
remote: Compressing objects: 100% (101/101), done.[K
remote: Total 174 (delta 57), reused 1 (delta 1), pack-reused 70[K
Receiving objects: 100% (174/174), 1.52 MiB | 2.82 MiB/s, done.
Resolving deltas: 100% (93/93), done.


A Pandas Series is a unidimentional matrix of indexed data. We can create one from a list, like this example

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

We can see in our output that the Series is wrpaped by a sequence of values and a sequence of indexes. Values are simply a Numpy matrix

In [4]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

And the index is a matrix of type pd.Index

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

We can access data through the associated index

In [6]:
data[1]

0.5

And do data slicing

In [7]:
data[1:3]

1    0.50
2    0.75
dtype: float64

Now, the main difference between Numpy indices and the Series object is that the Series index can be something other than an integer, we can use strings for our index.

In [10]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [11]:
data['b']

0.5

We can think of the Series as a specialized dictionary. We can even create a Pandas Series object from a Python dictionary

In [12]:
mass_dict = {'Sun': "1.989 × 10^30 kg",
                   'Mercury': "3.285 × 10^23 kg",
                   'Venus': "4.867 × 10^24 kg",
                   'Earth': "5.972 × 10^24 kg",
                   'Mars': "6.39 × 10^23 kg"}
mass = pd.Series(mass_dict)
mass

Sun        1.989 × 10^30 kg
Mercury    3.285 × 10^23 kg
Venus      4.867 × 10^24 kg
Earth      5.972 × 10^24 kg
Mars        6.39 × 10^23 kg
dtype: object

We can slice our Series

In [13]:
mass['Mercury':'Earth']

Mercury    3.285 × 10^23 kg
Venus      4.867 × 10^24 kg
Earth      5.972 × 10^24 kg
dtype: object

* The pandas Dataframe*
The next key object in Pandas is the DataFrame. This object can be considered a generalization of a matrix.

This object can be thinked of an ordered sequence of columns, sharing a row index. Lets create a new Series and then use this Series to create a Dataframe

In [14]:
grav_dict = {'Sun': "274 m/s²", 'Mercury': "3.7 m/s²", 'Venus': "8.87 m/s²",
             'Earth': "9.807 m/s²", 'Mars': "3.721 m/s²"}
grav = pd.Series(grav_dict)
grav

Sun          274 m/s²
Mercury      3.7 m/s²
Venus       8.87 m/s²
Earth      9.807 m/s²
Mars       3.721 m/s²
dtype: object

In [15]:
# lets use our mass dictionary from previous explanation
# Create a single Dataframe from both
objects = pd.DataFrame({'mass': mass,
                       'grav': grav})
objects

Unnamed: 0,mass,grav
Sun,1.989 × 10^30 kg,274 m/s²
Mercury,3.285 × 10^23 kg,3.7 m/s²
Venus,4.867 × 10^24 kg,8.87 m/s²
Earth,5.972 × 10^24 kg,9.807 m/s²
Mars,6.39 × 10^23 kg,3.721 m/s²


We can acces each object with its index

In [16]:
objects.index

Index(['Sun', 'Mercury', 'Venus', 'Earth', 'Mars'], dtype='object')

In [17]:
# read the Dataframe columns
objects.columns

Index(['mass', 'grav'], dtype='object')

In [18]:
# Access one of the columns, similar to a dictionary
objects['grav']

Sun          274 m/s²
Mercury      3.7 m/s²
Venus       8.87 m/s²
Earth      9.807 m/s²
Mars       3.721 m/s²
Name: grav, dtype: object

Please notice that we are calling the Dataframe *column*

In [19]:
objects['mass']

Sun        1.989 × 10^30 kg
Mercury    3.285 × 10^23 kg
Venus      4.867 × 10^24 kg
Earth      5.972 × 10^24 kg
Mars        6.39 × 10^23 kg
Name: mass, dtype: object

You can describe a Pandas Dataframe with `.describe()`

In [20]:
objects.describe()

Unnamed: 0,mass,grav
count,5,5
unique,5,5
top,1.989 × 10^30 kg,274 m/s²
freq,1,1


### Exercise
- Create a new dataframe from the following data
`data = {'ages': [23, 24, 25, 26, 27, 28], 'weight': [120, 130, 150, 160, 180, 190]}
- You should now have two data columns `ages` and `weight`. Print only the `ages` column.
- Select the row with age = 27 using `.loc[]`

In [8]:
data = {'ages': [23, 24, 25, 26, 27, 28], 'weight': [120, 130, 150, 160, 180, 190]}
df = pd.DataFrame(data)

print("Ages column:")
print(df['ages'])

print("\nRow with age = 27:")
print(df.loc[df['ages'] == 27])


Ages column:
0    23
1    24
2    25
3    26
4    27
5    28
Name: ages, dtype: int64

Row with age = 27:
   ages  weight
4    27     180


Lets open our cereal csv and check some Pandas functions.
You can use `head()` to show the first 5 rows.
You can use `tail()` to show the last 5 rows.

In [21]:
cereal = pd.read_csv('cereal.csv', index_col='name')
cereal.head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [22]:
cereal.tail()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Triples,G,C,110,2,1,250,0.0,21.0,3,60,25,3,1.0,0.75,39.106174
Trix,G,C,110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
Wheat Chex,R,C,100,3,1,230,3.0,17.0,3,115,25,1,1.0,0.67,49.787445
Wheaties,G,C,100,3,1,200,3.0,17.0,3,110,25,1,1.0,1.0,51.592193
Wheaties Honey Gold,G,C,110,2,1,200,1.0,16.0,8,60,25,1,1.0,0.75,36.187559


In [24]:
cereal.describe()

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,106.883117,2.545455,1.012987,159.675325,2.151948,14.597403,6.922078,96.077922,28.246753,2.207792,1.02961,0.821039,42.665705
std,19.484119,1.09479,1.006473,83.832295,2.383364,4.278956,4.444885,71.286813,22.342523,0.832524,0.150477,0.232716,14.047289
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,130.0,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.174094
50%,110.0,3.0,1.0,180.0,2.0,14.0,7.0,90.0,25.0,2.0,1.0,0.75,40.400208
75%,110.0,3.0,2.0,210.0,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.828392
max,160.0,6.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


In [23]:
cereal.columns

Index(['mfr', 'type', 'calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo',
       'sugars', 'potass', 'vitamins', 'shelf', 'weight', 'cups', 'rating'],
      dtype='object')

In [25]:
# Returns an array of the unique manufacturers
cereal.mfr.unique()

array(['N', 'Q', 'K', 'R', 'G', 'P', 'A'], dtype=object)

You can get info about the dataset by using the `info()` function, giving you the number of rows and columns, rows with non null values, type of data in each column and memory use.

In [26]:
cereal.info()

<class 'pandas.core.frame.DataFrame'>
Index: 77 entries, 100% Bran to Wheaties Honey Gold
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   mfr       77 non-null     object 
 1   type      77 non-null     object 
 2   calories  77 non-null     int64  
 3   protein   77 non-null     int64  
 4   fat       77 non-null     int64  
 5   sodium    77 non-null     int64  
 6   fiber     77 non-null     float64
 7   carbo     77 non-null     float64
 8   sugars    77 non-null     int64  
 9   potass    77 non-null     int64  
 10  vitamins  77 non-null     int64  
 11  shelf     77 non-null     int64  
 12  weight    77 non-null     float64
 13  cups      77 non-null     float64
 14  rating    77 non-null     float64
dtypes: float64(5), int64(8), object(2)
memory usage: 9.6+ KB


You can also output the shape of the dataframe with `shape`

In [27]:
#Should output 77 rows and 15 columns
cereal.shape

(77, 15)

You can use `.loc` to access specific data.
Lets return the cereals which have a protein content higher that 4.0

In [28]:
cereal.loc[cereal['protein'] >= 4.0]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Cheerios,G,C,110,6,2,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
Life,Q,C,100,4,2,150,2.0,12.0,6,95,25,2,1.0,0.67,45.328074
Maypo,A,H,100,4,1,0,0.0,16.0,3,95,25,2,1.0,1.0,54.850917
Muesli Raisins; Dates; & Almonds,R,C,150,4,3,95,3.0,16.0,11,170,25,3,1.0,1.0,37.136863
Muesli Raisins; Peaches; & Pecans,R,C,150,4,3,150,3.0,16.0,11,170,25,3,1.0,1.0,34.139765
Quaker Oat Squares,Q,C,100,4,1,135,2.0,14.0,6,110,25,3,1.0,0.5,49.511874
Quaker Oatmeal,Q,H,100,5,2,0,2.7,-1.0,-1,110,0,1,1.0,0.67,50.828392


Return cereals with protein higher than 2 grams and sugar lower than 6 grams

In [29]:
cereal.loc[(cereal['protein'] >= 2) & (cereal['sugars'] <= 6)]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Bran Chex,R,C,90,2,1,200,4.0,15.0,6,125,25,1,1.0,0.67,49.120253
Bran Flakes,P,C,90,3,0,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813
Cheerios,G,C,110,6,2,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
Corn Chex,R,C,110,2,0,280,0.0,22.0,3,25,25,1,1.0,1.0,41.445019
Corn Flakes,K,C,100,2,0,290,1.0,21.0,2,35,25,1,1.0,1.0,45.863324
Cream of Wheat (Quick),N,H,100,3,0,80,1.0,21.0,0,-1,0,2,1.0,1.0,64.533816
Crispix,K,C,110,2,0,220,1.0,21.0,3,30,25,3,1.0,1.0,46.895644


### Optional Exercise 1
- Can you find the average sugar content of the cereals which list the portion `cups` size as 1.0?
- Can you find the highest and lowest calorie content of the previous selection?

### Optional Exercise 2
- How many cereals by manufacturer `G` have a higher calorie content than 100?

In [49]:
df = pd.read_csv('cereal.csv')

portion_cups_1 = df[df['cups'] == 1.0]
average_sugar_content = portion_cups_1['sugars'].mean()

print(f"Average sugar content of cereals with portion cups size as 1.0: {average_sugar_content:.2f} grams")

Average sugar content of cereals with portion cups size as 1.0: 6.40 grams


In [48]:
highest_calorie = portion_cups_1['calories'].max()
lowest_calorie = portion_cups_1['calories'].min()

print(f"Highest calorie content: {highest_calorie} calories")
print(f"Lowest calorie content: {lowest_calorie} calories")

Highest calorie content: 150 calories
Lowest calorie content: 50 calories


In [47]:
manufacturer_G_high_calories = df[(df['mfr'] == 'G') & (df['calories'] > 100)].shape[0]

print(f"Number of cereals by manufacturer G with calorie content > 100: {manufacturer_G_high_calories}")

Number of cereals by manufacturer G with calorie content > 100: 17


### Handling duplicates
Next we will be using the `IMDB-Movie-Data.csv` dataset, it contains a list of movies, its genre, a brief description, director, actors, year, runtime, rating, votes, revenue and metascore.

This dataset does not have duplicates initially, but we can artificially create duplicates by appending the dataset to itself.

In [30]:
movies = pd.read_csv("IMDB-Movie-Data.csv", index_col='Title')
movies.head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [31]:
movies.shape

(1000, 11)

In [32]:
# now lets duplicate the data
# concatenate the dataframe to itself
dup_movies = pd.concat([movies, movies])
dup_movies.head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [33]:
dup_movies.tail()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
Nine Lives,1000,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


We now have 2000 movies, 1000 of which are duplicates

In [34]:
dup_movies.shape

(2000, 11)

We can use the `drop_duplicates(inplace=True)` method to delete the same Dataframe we are working with, this way we don't have to store the DF to a variable

In [35]:
dup_movies.drop_duplicates(inplace=True)

In [None]:
dup_movies.shape

The `drop_duplicates()` also has the `keep` argument, which can do the following:
- `first` (default) drop duplicates except for the first occurrence
- `last` drop duplicates except for the last occurrence
- `False` drop all duplicates/

### Cleaning columns
We can change the name of our columns to make them easier to work with, remove symbols, spaces, typos.

Lets print the columns of our original movie dataset.

In [36]:
movies.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

We can rename columns using a dictionary and the `rename()` method. New names are passed as a dictionary.

In [37]:
movies.rename(columns={'Runtime (Minutes)' : 'Runtime',
                        'Revenue (Millions)' : 'Revenue'}, inplace=True)
movies.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
       'Rating', 'Votes', 'Revenue', 'Metascore'],
      dtype='object')

In [38]:
movies.Revenue

Title
Guardians of the Galaxy    333.13
Prometheus                 126.46
Split                      138.12
Sing                       270.32
Suicide Squad              325.02
                            ...  
Secret in Their Eyes          NaN
Hostel: Part II             17.54
Step Up 2: The Streets      58.01
Search Party                  NaN
Nine Lives                  19.64
Name: Revenue, Length: 1000, dtype: float64

In [39]:
movies.Runtime

Title
Guardians of the Galaxy    121
Prometheus                 124
Split                      117
Sing                       108
Suicide Squad              123
                          ... 
Secret in Their Eyes       111
Hostel: Part II             94
Step Up 2: The Streets      98
Search Party                93
Nine Lives                  87
Name: Runtime, Length: 1000, dtype: int64

We can also rename by using the `columns` property and passing a list of names. Lets have all of our names be lowercase.

In [40]:
movies.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 'rating', 'votes', 'revenue_millions', 'metascore']
movies.columns

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

We can also simplify renaning using list comprehension.

In [41]:
movies.columns = [col.upper() for col in movies]
movies.columns

Index(['RANK', 'GENRE', 'DESCRIPTION', 'DIRECTOR', 'ACTORS', 'YEAR', 'RUNTIME',
       'RATING', 'VOTES', 'REVENUE_MILLIONS', 'METASCORE'],
      dtype='object')

In [42]:
movies.columns = [col.lower() for col in movies]
movies.columns

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

### Working with missing values
When working with datasets, we are almost guaranteed to find missing data or null values. We can sometimes find `None` of Numpy's `np.nan`. We have two options:
- Get rid of rows or columns with null values.
- Replace null values with something else.
First, lets check if our movie columns contains null values.

`isnull()` will return if each cell in the dataset has a null value, evaluating to `True` or `False`.

We can then count the number of nulls in each column using the `sum()` method.

In [43]:
movies.isnull()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,False,False,False,False
Split,False,False,False,False,False,False,False,False,False,False,False
Sing,False,False,False,False,False,False,False,False,False,False,False
Suicide Squad,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,False,False,False,False,False,False,False,False,False,True,False
Hostel: Part II,False,False,False,False,False,False,False,False,False,False,False
Step Up 2: The Streets,False,False,False,False,False,False,False,False,False,False,False
Search Party,False,False,False,False,False,False,False,False,False,True,False


In [50]:
movies.isnull().sum()

rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64

We can see that our `revenue_millions` column and `metascore` column have null values. To remove a row containing a null value you can use `.dropna()`. To remove a column containing a null value you can change the axis by running `.dropna(axis=1)`. Remember the first axis we get on the shape of a DataFrame is rows, and the second one is columns, this is where the axis 0 = rows, and axis 1 = columns comes from.

Lets see a dropping example with a temporary DF.

In [51]:
# Copy the original Dataframe into a new variable
drop_movies = movies.copy()
drop_movies.head()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [52]:
drop_movies.shape

(1000, 11)

In [53]:
drop_movies.isnull().sum()

rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64

In [54]:
drop_movies.dropna(inplace=True)

In [55]:
drop_movies.shape

(838, 11)

In [56]:
drop_movies.isnull().sum()

rank                0
genre               0
description         0
director            0
actors              0
year                0
runtime             0
rating              0
votes               0
revenue_millions    0
metascore           0
dtype: int64

It seems like a waste to drop all of those rows that were missing some data. Lets replace the missing values with something else in our original dataframe `movies`

In [57]:
movies.isnull().sum()

rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64

In [58]:
# Select the revenue column
revenue = movies['revenue_millions']

In [59]:
# Series type, not dataframe
revenue

Title
Guardians of the Galaxy    333.13
Prometheus                 126.46
Split                      138.12
Sing                       270.32
Suicide Squad              325.02
                            ...  
Secret in Their Eyes          NaN
Hostel: Part II             17.54
Step Up 2: The Streets      58.01
Search Party                  NaN
Nine Lives                  19.64
Name: revenue_millions, Length: 1000, dtype: float64

In [60]:
# Calculate revenue mean
revenue_mean = revenue.mean()
revenue_mean

82.95637614678898

Lets fill up our missing values with our new `revenue_mean` value using the `.fillna()` method. Remember to use the `inplace=True` to affect the original **Dataframe**.

In [61]:
revenue.fillna(revenue_mean, inplace=True)

In [62]:
movies.isnull().sum()

rank                 0
genre                0
description          0
director             0
actors               0
year                 0
runtime              0
rating               0
votes                0
revenue_millions     0
metascore           64
dtype: int64

### Optional exercise
- Fill out the metascore missing values with the mean of metascores.
- Verify that the `movies` DF has no null values.

In [65]:
data = {'title': ['Movie A', 'Movie B', 'Movie C', 'Movie D', 'Movie E'],
        'year': [2010, 2011, 2012, 2013, 2014],
        'metascore': [85, 80, None, 75, None]}
movies = pd.DataFrame(data)

mean_metascore = movies['metascore'].mean()

movies['metascore'].fillna(mean_metascore, inplace=True)

if movies.isnull().values.any():
    print("There are still null values in the DataFrame.")
else:
    print("The DataFrame has no null values after filling.")

print("\nUpdated DataFrame:")
print(movies)

The DataFrame has no null values after filling.

Updated DataFrame:
     title  year  metascore
0  Movie A  2010       85.0
1  Movie B  2011       80.0
2  Movie C  2012       80.0
3  Movie D  2013       75.0
4  Movie E  2014       80.0


### Understanding your variables
Remember you can get a summary of your data using the `describe()` method. Lets describe `movies`

In [68]:
movies.describe()

Unnamed: 0,year,metascore
count,5.0,5.0
mean,2012.0,80.0
std,1.581139,3.535534
min,2010.0,75.0
25%,2011.0,80.0
50%,2012.0,80.0
75%,2013.0,80.0
max,2014.0,85.0


To describe a single column you can select the column and run the method. You can also describe string columns, like `genre`

In [74]:
movies.genre.describe()

AttributeError: 'DataFrame' object has no attribute 'genre'

To count frequency of data you can use `value_counts()`

In [70]:
# Show top 10 genres
movies.genre.value_counts().head(10)

AttributeError: 'DataFrame' object has no attribute 'genre'

We can further try to describe relationship in our variables by using the `.corr()` method, this will show the relationship between two variables (bivariate relationship). Positive numbers indicate a positive correlation, one goes up when the other variable goes up. Negative numbers indicate an inverse correlation - one goes up the other goes down. 1.0 indicates a perfect correlation.

In [None]:
movies.corr()

### Manipulating Dataframes
On the next section we will further explore how to explore Dataframe and Series:
- Selecting columns
- Selecting rows
- Conditional selections

By now we now how to select columns using square brackets

In [71]:
genre_col = movies['genre']
genre_col

KeyError: 'genre'

The following code returns a Series. To extract a column as a DataFrame you need to pass a list of column names, in our case a single column

In [None]:
type(genre_col)

In [None]:
genre_col = movies[['genre']]
type(genre_col)

Since its just a list, adding another column name is easy

In [None]:
subset = movies[['genre', 'rating']]
subset.head()

#### Data by rows
For rows we have two options:
- `.loc` - locates by name
- `.iloc` - locates by numerical index

We are still indexed by movie Title so to use `.loc` we give the Title of the movie

In [None]:
prom = movies.loc['Prometheus']
prom

With `.iloc` we would give the numerical index of the movie

In [None]:
movies.iloc[1]

These methods can be thought of as similar to `list` slicing. Lets select multiple rows.

In [None]:
subset = movies.loc['Prometheus':'Sing']
subset

In [None]:
# Or using iloc
movies.iloc[1:4]

#### Conditional selection
What if we want to make a conditional selection? Lets say we want to filter our movies to show only films directed by Ridley Scott or films with a rating greater than or equal to 8.0.

Here is an example of those conditions

In [None]:
condition = (movies['director'] == 'Ridley Scott')
condition.head()

In [None]:
# Filter out movies that dont fulfill the condition
# Show only movies by this director
# Select movies df where movies director equals Ridley Scott
movies[movies['director'] == 'Ridley Scott']

We can create more conditions by using the logical operators `|` "or" and `&` "and".

Lets filter to show movies only by Christopher Nolan or Ridley Scott

In [None]:
movies[(movies['director'] == 'Christopher Nolan') | (movies['director'] == 'Ridley Scott')]

We could also use the `isin()` method to make this more concise

In [None]:
movies[movies['director'].isin(['Christopher Nolan', 'Ridley Scott'])]

Can you get all movies with the following conditions?
- Released between 2005 and 2010
- Have a rating above 8.0
- Made below the 25th percentile in revenue

In [None]:
movies[((movies['year'] >= 2005) & (movies['year'] <= 2010)) # get movies between years 2005 and 2010
      & (movies['rating'] > 8.0) # get movies with rating greater than 8.0
      & (movies['revenue_millions'] < movies['revenue_millions'].quantile(0.25))] # get movies with revenue less than 25th percentile

If you want to check this, you can remember we used `.describe()` to get the 25th percentile of revenue and it was about 17.4, we can access the value directly by using the `quantile()` method with a float of 0.25.

In [None]:
movies.describe()

### Finally a brief look at functions
We will cover functions more in detail in a future lesson, for now you should know that iterating over a DataFrame as it was a list, for example with a for loop, is very slow. An efficient alternative to this is using the `apply()` function. This method passes every value in the DataFrame through a function using vectorization.

Vectorization is a style of programming where operations are applied to whole arrays instead of individual elements, this makes it really efficient.

Lets create a function to rate our movies. The following code creates a function that will return "good" if the rating is equals or greater than 8.0, else it will return false. This rating will be created as a new column in our DataFrame.

In [None]:
# Create our function
# Will return "good" if x is greater or equals to 8.0
# Will return "bad" if x is less than 8.0

# def is a keyword to define functions. We will see more of this in a future lesson
def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"

In [None]:
movies['rating_category'] = movies['rating'].apply(rating_function)
movies.head()

In [44]:
subset.to_csv('subset.csv') # this is how you save a selection to csv
# the code above will create a new csv called 'subset.csv'

NameError: name 'subset' is not defined

### Optional Exercise
- Select the movies with ratings between 6.0 and 8.5 released between 2010 and 2016. Save this selection to a variable.
- Select movies with runtime greater than 120 and rating greater or equal to 8.0. Save this selection to a variable.
- Save these two selections to two `.csv` files using the `df.to_csv('name_of_csv.csv')` function. Where 'df' is the name of your DataFrame


In [72]:
data = {'title': ['Movie A', 'Movie B', 'Movie C', 'Movie D', 'Movie E'],
        'year': [2010, 2011, 2012, 2013, 2014],
        'rating': [7.5, 8.2, 6.8, 9.0, 7.2],
        'runtime': [110, 130, 115, 140, 105]}
movies = pd.DataFrame(data)

selection1 = movies[(movies['rating'] >= 6.0) & (movies['rating'] <= 8.5) & (movies['year'].between(2010, 2016))]
selection2 = movies[(movies['runtime'] > 120) & (movies['rating'] >= 8.0)]

selection1.to_csv('movies_selection1.csv', index=False)
selection2.to_csv('movies_selection2.csv', index=False)

print("Selections saved to CSV files: movies_selection1.csv and movies_selection2.csv")

Selections saved to CSV files: movies_selection1.csv and movies_selection2.csv
