# Lesson 2 | Basic Pandas


### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Importing data with pandas
- Run basic data exploration analysis.
- Change data types and find outliers.
- Filter pandas dataframes.
- Create simple reports to summerize data. 

# Table of Contents
1. [What is Pandas](#whatis)
2. [Import data using Pandas](#import)
3. [Pandas Data Frame & Basic EDA](#basiceda)
4. [Filtering Data Frame](#filtdata)
5. [Adding new column or Row](#newcol)
6. [Groupby and agg](#groupby)
7. [Merge](#merge)

# 1. What is Pandas?<a class="anchor" id="whatis"></a>

## These are pandas:
![SegmentLocal](Images/pandas.gif "segment")

## But not the one we will learn today!

## Pandas ~ Excel
I would say that pandas is the closest from a excel table that you can get using python.

It is an open source library written in Python for data manipulation and analysis, it's initial release was in 2008, it was initially developed out of the need of high performance to run quantative analysis on finantial data. Pandas was build on using a Python library called Numpy. Numpy is a library used to support large multi-dimentional arrays and matrices, it is very popular in the sientific community.
![title](./Images/pandas_description.png)

source: https://pandas.pydata.org/, https://en.wikipedia.org/wiki/Pandas_(software)

# 2. Import data using Pandas <a class="anchor" id="import"></a>
- In order to use the pandas pakage we will need to import it. `Import pandas`
- We add the `as pd` after the import in order to use the variable `pd` to refer to the pandas pakage. This way we are able to acess all the methods and functions of this class by only typing `pd` instead of `pandas`.

In [1]:
import pandas as pd

## Works with multiple file types:
#### Excel, CSV, Json, Pickle, txt, XML, HTML, parquet 

-  Here are some examples:
- `pd.read_csv(file_path + 'file_name.csv', sep=',')`

- `pd.read_excel(file_path + 'file_name.xls', sheet='sheet_name')`

- `pd.read_sql('select * from my_table', connection)`

If the file is saved in the same folder as the code you are running you don't need to add the file path.

##### Relative path vs absolute path
 - I recommended to use absolute path, the absolute path 

In [2]:
# Absolute path WILL NOT WORK!!!
abs_path = 'C:/Users/berku/Desktop/DSI/Personal Git Hub/python_lessons/Class 2/data/'

# The file can be imported with the absolute or relative path
df = pd.read_csv(abs_path + 'netflix_titles.csv')

In [3]:
# Relative path WILL WORK!!
# The file can be imported with the absolute or relative path
rel_path = './data/'
df = pd.read_csv(rel_path + 'netflix_titles.csv')

#### Common arguments:
- **usecols** => If you don't need all the columns in the dataframe.
- **dtype** => Change data types.
- **parse_dates** => Read column as date time.

In [4]:
# Usecols:
df = pd.read_csv(rel_path + 'netflix_titles.csv', 
                 usecols=['show_id','type','date_added', 'release_year']
                )
df.head(1)

Unnamed: 0,show_id,type,date_added,release_year
0,81145628,Movie,"September 9, 2019",2019


In [5]:
# dtype and parse_dates and Usecols
df = pd.read_csv(rel_path + 'netflix_titles.csv', 
                 usecols=['show_id','type','date_added', 'release_year'], 
                 dtype={'show_id': int, 'type': str, 'release_year': float}, 
                 parse_dates=['date_added'])
df.head(1)

Unnamed: 0,show_id,type,date_added,release_year
0,81145628,Movie,2019-09-09,2019.0


#### Import we will use for this project

In [6]:
rel_path = './data/'
df = pd.read_csv(rel_path + 'netflix_titles.csv', 
                 parse_dates=['date_added'])

# 3.Pandas Data Frame & Basic EDA <a class="anchor" id="basiceda"></a>

### Who Loves movies ??
#### We will explore a Data Base with Movies and TV Shows that were listed on Netflix.

<img src="./Images/netflix2.jpg" alt="Drawing" style="width: 200px;"/>

### Pandas Data Frame
A pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

**Structure:**
![title](./Images/pandas_dataframe.png)


## Use .head(), .sample() or .tail() to look at part of the data

In [7]:
# Check out the difference between
df.head()
df.tail()
df.sample()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1822,80208082,Movie,Kill Hitler! The Luck of the Devil,Frédéric Tonolli,Jean-Baptiste Marcenac,France,2019-05-01,2015,TV-14,53 min,"Documentaries, International Movies","From politicians to officers, many attempted t..."


### Selecting Data From specfic rows / columns
- df[list of column names separated by commas]
- df.loc[row index, column names]

In [8]:
# Select a single column
df['release_year']

0       2019
1       2016
2       2013
3       2016
4       2017
        ... 
6229    2015
6230    2016
6231    2016
6232    2013
6233    2003
Name: release_year, Length: 6234, dtype: int64

In [9]:
# select multiple columns
df[['show_id', 'type', 'title', 'director']].head(1)

Unnamed: 0,show_id,type,title,director
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby"


### EDA = exploratory data analysis

EDA is about getting to know more about your data finding its main characteristics.
When working with a new data set I would recomment running at least the methods below:

- **shape**<br />
Returns a tuple with number of rows and columns of the data frame.
- **dtypes**<br />
Are the data types what we expect them to be?<br />
- **describe**<br />
Will Describe create basis statistics. Default is to calculate it only for numeric columns.
- **value_counts**<br />
For a given column counts the occurrence of values.
- **info** gives you non null counts for each column and memory usage<br />
- **sort_values** to sort column values ascending or descending.<br />
- **.isnull().sum()** sum of null rows for each column
- **nunique() and unique()** Count or visualize unique values for a column.
- **drop_duplicates()** Drops duplicates of a subset of columns
- **sum, mean, min, max, mode, std, median**
### Run the examples below:

In [None]:
df.shape

In [None]:
df.head(2)

In [None]:
df.dtypes

**3.1 Practice** Are the column data types what we expect them to be?
- Yes
- No

In [None]:
# Will run it on only numeric columns
df.describe() 

In [None]:
# Will run it on only object columns
df.describe(include='O') 

In [None]:
# Will run it on only date columns
df.describe(include='datetime64[ns]') 

In [None]:
df['release_year'].sort_values(ascending=True)

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.columns

In [None]:
print('Number of unique country combination is:', df['country'].nunique())
df['country'].unique()

In [19]:
df['type'].drop_duplicates()

0      Movie
2    TV Show
Name: type, dtype: object

In [20]:
df['release_year'].max()
df['release_year'].median()

2016.0

**3.2 Practice** Using the describe method answer the following questions:
- What date had the most number of movies added?
- What is the oldest release date for movie or tv show?
- Are there movies with the same description text?

**3.3 Practice** What are the 3 directors with highest number of shows in netflix.

# 4. Filtering Data <a class="anchor" id="filtdata"></a> 
To understand this we will need to learn about.
 - Panda series<br />
   Similar to a column in excel, is a one dimentional array with a single data type.   
 - Using a boolean mask on a Panda Series<br />
   We can create boolean list with the same lenght as the series to filter the panda series 
   using the [] acessor.
 - Using a Boolean mask on a Pandas DataFrame<br />

In [None]:
my_series = df['release_year'].head()

print(type(df['release_year']))
print('----------------------')
print(my_series)

In [None]:
# We can create boolean list with the same lenght as the series to filter the panda series.
bool_mask = [True, False, False, False, True]

# Using 
my_series[bool_mask]

In [None]:
my_series>2016

In [None]:
my_series[my_series>2016] # Yields the same results.

In [None]:
# The same can be done in the dataframe, 
# First we create a boolean mask based on a column for example
my_mask = df['release_year'] > 2016
print(my_mask)

In [None]:
# Second use the [] acessor to filter the data frame
df[my_mask].head(2)

In [None]:
# filtering data frame with multiple conditions
df[(df['country']=='United States')
    & (df['release_year'] == 1988)]

In [None]:
# filtering data frame with multiple conditions
df[(df['description'].str.contains('panda'))]

In [None]:
# filtering data frame with multiple conditions
df[(df['title'].str.len() < 10)
| (df['release_year'] < df['release_year'].mean())]

**4.1 Practice** 
- Filter and order the 5 most recent TV Shows released in spain.
- Using values counts and filtering the data frame. Display the the top 10 realease_years that had the most Movies or TV shows that contain the word love in their description.

# 5. Adding new column or Row <a class="anchor" id="newcol"></a> 

We can create a new column based on an existing one or based on a list with the same size as the lenght of the data frame.


In [None]:
df['release_year_plus1'] = df['release_year'] + 1

### Deep dive into .loc and basic .iloc

In [None]:
# other way to select multiple columns
# df.loc[:, 'show_id':'director']

# Returns values for all columns regarding row index 2
# df.loc[2]

# 6. Groupby and agg  <a class="anchor" id="groupby"></a> 
# 7. Merge DataFrames <a class="anchor" id="merge"></a> 

**Practice 1.01**: Run the three cells below individually, than merge the cells and run it again.

In [None]:
sales = 1 + 2

In [None]:
sales

**Practice 6.01:** Write a function that returns True if a number is even and False if its odd.

## Final Practice

**1. Fizzbuzz**



### 7. Appendix
