# Intro to Pandas
So far we've looked at the basic building blocks of Python: variables, data structures, strings, and control flow. 

While we could analyse a text file using only these tools, it would be difficult. Imagine having a filw with 10,000 customer reviews - how would you find all the reviews from a specific date or calculate the average rating ?

This is where __pandas__ comes in. Pandas is the most popular Python library for data manipulation and analysis. It gives us a powerful new tool that looks and feels like a spreadsheet or a table, but with all the power of a programming language. 

In this notebook, you will learn how to:
- Import the pandas library
- Load a CSV file from the internet into a pandas Dataframe
- Inpect your data to understand its basic properties
- Select specific columns and rows that you are interested in

## Importing Pandas
First, we need to import the library. The standard convention is to import pandas and give it the short alias pd. This saves us typing and is understood by all python data analysis

In [12]:
import pandas as pd

### Leading data into a DataFrame
The core of pandas is the __DataFrame__. It's two dimensional table, just like a spreadsheet, with rows and columns. 

One of the best features of Google Collab is that it can easily access data directly from the internet. We're going to load a simple dataset of movie information using a URL. 

The command is pd.read_csv(), and we just give ti the URL of the raw data file. 

In [13]:
# The URL of the dataset we are going to use
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/fandango/fandango_score_comparison.csv'

# Use pandas to read the csv file and store it in a variable called 'movies'
movies = pd.read_csv(url)

With one line of code, we have loaded an entire dataset into your notebook. The movies variable now holds our dataframe.

### Inspecting your data
Now that the data is loaded, the first thing we always do is inspect it to understand what we're working with. Here are the most essential commands. 

#### .head() Look at the first few rows
This is the best way to get a quick glance at your data and see what the columns look like. By default, it shows the first 5 rows. 

In [14]:
movies.head()

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5
2,Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,...,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5
3,Do You Believe? (2015),18,84,22,4.7,5.4,5.0,4.5,0.9,4.2,...,2.7,1.0,4.0,1.0,2.5,2.5,31,3136,1793,0.5
4,Hot Tub Time Machine 2 (2015),14,28,29,3.4,5.1,3.5,3.0,0.7,1.4,...,2.55,0.5,1.5,1.5,1.5,2.5,88,19560,1021,0.5


#### .tail() - Look at the last few rows
Similarly, you cann look at the end of your data to make sure it loaded correctly

In [15]:
movies.tail()

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
141,Mr. Holmes (2015),87,78,67,7.9,7.4,4.0,4.0,4.35,3.9,...,3.7,4.5,4.0,3.5,4.0,3.5,33,7367,1348,0.0
142,'71 (2015),97,82,83,7.5,7.2,3.5,3.5,4.85,4.1,...,3.6,5.0,4.0,4.0,4.0,3.5,60,24116,192,0.0
143,"Two Days, One Night (2014)",97,78,89,8.8,7.4,3.5,3.5,4.85,3.9,...,3.7,5.0,4.0,4.5,4.5,3.5,123,24345,118,0.0
144,Gett: The Trial of Viviane Amsalem (2015),100,81,90,7.3,7.8,3.5,3.5,5.0,4.05,...,3.9,5.0,4.0,4.5,3.5,4.0,19,1955,59,0.0
145,"Kumiko, The Treasure Hunter (2015)",87,63,68,6.4,6.7,3.5,3.5,4.35,3.15,...,3.35,4.5,3.0,3.5,3.0,3.5,19,5289,41,0.0


#### .shape - Check the dimensions
This attribute tells you the size of your dataframe: (number of rows, number of columns)

In [16]:
movies.shape

(146, 22)

#### .info - Get a technical summary
This methods gives you a summary of your DataFrame, including the name of each column, the number of non-missing values, and the data type (Dtype) of each column. This is fantastic for understanding your data's structure

In [17]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   FILM                        146 non-null    object 
 1   RottenTomatoes              146 non-null    int64  
 2   RottenTomatoes_User         146 non-null    int64  
 3   Metacritic                  146 non-null    int64  
 4   Metacritic_User             146 non-null    float64
 5   IMDB                        146 non-null    float64
 6   Fandango_Stars              146 non-null    float64
 7   Fandango_Ratingvalue        146 non-null    float64
 8   RT_norm                     146 non-null    float64
 9   RT_user_norm                146 non-null    float64
 10  Metacritic_norm             146 non-null    float64
 11  Metacritic_user_nom         146 non-null    float64
 12  IMDB_norm                   146 non-null    float64
 13  RT_norm_round               146 non

## Selecting Data
Now for the most important part: how do we grab the specific pieces of data we're interested in ?

#### Selecting Columns
You can select single column by putting its name in square brackets []. This will give you a Pandas Series, which is what we call a single column

In [18]:
# Select the 'FILM' column
film_column = movies['FILM']

# Display the first 5 entries of this column
film_column.head()

0    Avengers: Age of Ultron (2015)
1                 Cinderella (2015)
2                    Ant-Man (2015)
3            Do You Believe? (2015)
4     Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object

To select multiple columns, you pass a list of column nmaes inside the square brackets. Notice the double square brackets

In [19]:
# Select the 'FILM', 'RottenTomatoes', and 'IMDB' columns
selected_columns = movies[['FILM', 'RottenTomatoes', 'IMDB']]

selected_columns.head()

Unnamed: 0,FILM,RottenTomatoes,IMDB
0,Avengers: Age of Ultron (2015),74,7.8
1,Cinderella (2015),85,7.1
2,Ant-Man (2015),80,7.8
3,Do You Believe? (2015),18,5.4
4,Hot Tub Time Machine 2 (2015),14,5.1


#### Selecting Rows (Filtering)
We can select rows based on condititons, just like we did with if statements. 

Let's find all the movies that have an IMDB rating higher than 8. First let's see what the condition looks like on its own. It produces a Series of True or False values

In [20]:
# This condition checks each row to see if its 'IMDB' value is greater than 8
movies['IMDB'] > 8

0      False
1      False
2      False
3      False
4      False
       ...  
141    False
142    False
143    False
144    False
145    False
Name: IMDB, Length: 146, dtype: bool

Now, we can use this True/False Series to select only the rows from our DataFrame whwere the condition True

In [21]:
# Put the condition inside the square brackets
highly_rated_movies = movies[movies['IMDB'] > 8]

highly_rated_movies.head()

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
14,The Imitation Game (2014),90,92,73,8.2,8.1,5.0,4.6,4.5,4.6,...,4.05,4.5,4.5,3.5,4.0,4.0,566,334164,8055,0.4
28,Wild Tales (2014),96,92,77,8.8,8.2,4.5,4.1,4.8,4.6,...,4.1,5.0,4.5,4.0,4.5,4.0,107,50285,235,0.4
42,About Elly (2015),97,86,87,9.6,8.2,4.0,3.6,4.85,4.3,...,4.1,5.0,4.5,4.5,5.0,4.0,23,20659,43,0.4
76,Straight Outta Compton (2015),90,94,72,7.3,8.4,5.0,4.8,4.5,4.7,...,4.2,4.5,4.5,3.5,3.5,4.0,90,15982,8096,0.2
86,Me and Earl and The Dying Girl (2015),81,89,74,8.4,8.2,4.5,4.3,4.05,4.45,...,4.1,4.0,4.5,3.5,4.0,4.0,41,5269,624,0.2


Let's try one more. Let's find all the movies with a RottenTomatoes score of less than 20

In [22]:
# Find all the poorly-rated movies on Rotten Tomatoes
poorly_rated_movies = movies[movies['RottenTomatoes'] < 20]

poorly_rated_movies.head()

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
3,Do You Believe? (2015),18,84,22,4.7,5.4,5.0,4.5,0.9,4.2,...,2.7,1.0,4.0,1.0,2.5,2.5,31,3136,1793,0.5
4,Hot Tub Time Machine 2 (2015),14,28,29,3.4,5.1,3.5,3.0,0.7,1.4,...,2.55,0.5,1.5,1.5,1.5,2.5,88,19560,1021,0.5
15,Taken 3 (2015),9,46,26,4.6,6.1,4.5,4.1,0.45,2.3,...,3.05,0.5,2.5,1.5,2.5,3.0,240,104235,6757,0.4
19,Pixels (2015),17,54,27,5.3,5.6,4.5,4.1,0.85,2.7,...,2.8,1.0,2.5,1.5,2.5,3.0,246,19521,3886,0.4
33,The Boy Next Door (2015),10,35,30,5.5,4.6,4.0,3.6,0.5,1.75,...,2.3,0.5,2.0,1.5,3.0,2.5,75,19658,2800,0.4


## Exercises
Time to practice what you've learned. Use the movies DataFrame to answer the following questions.

#### Exercise 1: Shape and Size
How many rows and columns are in the movies dataframe ? 

In [23]:
# Your code for Exercise 1 here

#### Exercise 2: First look
Display the first 8 rows of the Dataframe (Hint .head() can take a number as an argument)

In [24]:
# Your code for Exercise 2 here

#### Exercise 3: Specific Columns
Create a new dataframe called ratings that contain only the 'FILM', 'IMDB' and Metacritic columns. Display the first 5 rows of this new Dataframe

In [25]:
# Your code for Exercise 3 here

#### Exercise 4: Filtering
Create a new DataFrame called mc_great_metrics that contains only the movies with a Metacritic score of 90 or higher. Display this new DataFrame.

## Conclusion
You now have a great introduction to pandas and should now understand how powerful python is. You now know how  to:
- Load tabular data with pd.read_csv()
- Inspect a DataFrame with .head(), .shape, and .info()
- Select Columns df[['col1','col2']]
- Filter rows based on conditions df[df['col] > value]

These skills are the foundation for everything that comes next. Now that we can load and select data, our next step will be cleaning and processing the text within our DataFrames columns