# Pandas Tutorial

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png" width="30%">

In [None]:
import pandas as pd

## DataFrames and Series

The two main elements in Pandas are `Series` and ``DaraFrames`.

`Series` is basically the information of a column/attribute, and `DataFrame` is a multidimensional table composed by a collection of `Series`.

<img src="https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png" width=600px />

### Creating DataFrames from scratch

You can create a `DataFrame` from a simple `Dictionary`.

In the example we have a fruit shop that sells apples and oranges. We want a column for each fruit, and a row for the sell of a client.

In [None]:
data = {
    'apples': [3.2, 2, 0, 1], 
    'oranges': [0, 3, 7, 2],
    'tip': ['yes', 'no', 'yes', 'yes']
}

In [None]:
purchases = pd.DataFrame(data)
purchases

The **Index** of this DataFrame was automatically create, using the number 0-3, but we could assign them as we wish.

Now the client's names will be the indexes:

In [None]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Now we can search the data from a client using its name:

In [None]:
purchases.iloc[0].tip

In [None]:
purchases.loc['June']

In [None]:
purchases.iloc[0]

We can also access by row:

In [None]:
purchases['oranges']

In [None]:
atribs = ['oranges', 'tip']
purchases[atribs]

In [None]:
purchases.oranges

### Reading from a CSV


In [None]:
df = pd.read_csv('purchases.csv')

df.head()

In [None]:
?pd.read_csv

Reading we can choose with column is `index_col`:

In [None]:
df = pd.read_csv('purchases.csv', index_col=0)

df

## Usual operations with DataFrame

We are going to load a list of IMDB films:

In [None]:
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

### Visualizing the Data

We show a few rows with `.head()`:

In [None]:
movies_df.head()

`.head()` show the first **cinco** rows by default, but you can indicate another number `movies_df.head(10)`.

To see the last **rows** we use `.tail()`. 

In [None]:
movies_df.tail(2)

### Getting information from your data

`.info()` should one of your first methods after loading your data.

In [None]:
movies_df.info()

In [None]:
movies_df.shape

shape returns the number of instance and columns.

### Rename the column's name

We can rename the column names if we want it.

In [None]:
movies_df.columns

In [None]:
movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)


movies_df.columns

In [None]:
movies_df.Runtime

In [None]:
movies_df.head()

Another option, to have all column names in lowercase. Instead of `rename()` we directly modify the `.columns` field.

In [None]:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']


movies_df.columns

But that's too much work. Instead of just renaming each column manually we can do a list comprehension:

In [None]:
movies_df.columns = [col.lower() for col in movies_df]

movies_df.columns

### Learning about numerical variables

`describe()` returns a summary of the distribution of all numerical columns:

In [None]:
movies_df.describe()



`.describe()` can be used also with categorical values.

In [None]:
movies_df['genre'].describe()

In [None]:
movies_df['genre'].value_counts().head(10)

#### Correlation between attributes

With method: `.corr()`:

In [None]:
movies_df.corr()

### DataFrame: slicing, filter and 

#### By columns


In [None]:
genre_col = movies_df['genre']

type(genre_col)

In [None]:
genre_col = movies_df[['genre']]

type(genre_col)

In [None]:
subset = movies_df[['genre', 'rating']]

subset.head()

#### By rows

 

- `.loc` - search by column name.
- `.iloc`- search by index.



In [None]:
prom = movies_df.loc["Guardians of the Galaxy"]

prom

In [None]:
prom = movies_df.iloc[0]
prom

In [None]:
movie_subset = movies_df.loc['Prometheus':'Sing']

movie_subset = movies_df.iloc[1:4]

movie_subset



#### Conditional Selection


In [None]:
condition = (movies_df.director == "Ridley Scott")

condition.head()

In [None]:
movies_df[condition]

In [None]:
movies_df[movies_df.index == 'La La Land']

In [None]:
movies_df[movies_df['director'] == "Ridley Scott"].head()

In [None]:
movies_df[movies_df['rating'] < 4].sort_values('rating', ascending=True)

In [None]:
help(movies_df.sort_values)

In [None]:
movies_df.rating

# Exercise

Show the directors that have done a Sci-fi film with an rating greater or equals to 8.