# Pandas In Python

**By Muhammad Kareem**

### What is pandas ?

Pandas is a powerful python module for data analysis and it has essential features for any data science project

### Why pandas?

Pandas provide awesome features for data analysis and visualization with very few lines of code yet outstanding effect, Some of the major features it has are as follows :

- Sorting out data and creating awesome insights on the data

- Beautiful plots on the generated data to provide a visualizaed view of the data for non - programmers 

- Indexing and sorting out values in data

- Handling big data (data as big as millions of entries)

- Cleaning up data and handling missing values

### Installing pandas 

`pip install pandas` for regular installation and refer to the [documentations of pandas](https://docs.anaconda.com/anaconda/navigator/tutorials/pandas/) if you're using any virtual environment

### Creating your first dataframe

In this very basic example we'll see what are dataframes and how to work with data

In [1]:
# FIRST WE WILL IMPORT PANDAS

import pandas as pd

employees = pd.DataFrame({
    'name':['Joe','Peter','Chris','Quagmire'],
    'job':['Data scientist','Python developer','Java developer','C developer'],
    'age':[26,25,31,29]
})
employees

Unnamed: 0,name,job,age
0,Joe,Data scientist,26
1,Peter,Python developer,25
2,Chris,Java developer,31
3,Quagmire,C developer,29


So to create the dataframe, pandas has a `DataFrame()` method to help us do that and to structure a dataframe it's basically a dictionary where the keys are columns names and the values are a list of values for those columns

### Indexing your data

So imagine we'll create like a dataframe of some sort of sales of products and will include three customers reviews of the products and then we'll see how to explore the data 

In [4]:
sales = pd.DataFrame({
    'Product':['2016 Sales','2017 Sales','2018 Sales','2019 Sales','2020 Sales'],
    'Peter':[92,91,88,95,66],
    'Louis':[84,91,75,83,90],
    'Meg':[96,86,80,88,90],
})
sales

Unnamed: 0,Product,Peter,Louis,Meg
0,2016 Sales,92,84,96
1,2017 Sales,91,91,86
2,2018 Sales,88,75,80
3,2019 Sales,95,83,88
4,2020 Sales,66,90,90


Now this looks alright but don't you think those index numbers are kind of useless here ? we're supposed to view reviews data of sales so why not index by the product so all the other columns are basically reviews of it ? Let's see how to do that

In [7]:
sales = pd.DataFrame({
    'Peter':[92,91,88,95,66],
    'Louis':[84,91,75,83,90],
    'Meg':[96,86,80,88,90]
},index=['2016 Sales','2017 Sales','2018 Sales','2019 Sales','2020 Sales'])
sales

Unnamed: 0,Peter,Louis,Meg
2016 Sales,92,84,96
2017 Sales,91,91,86
2018 Sales,88,75,80
2019 Sales,95,83,88
2020 Sales,66,90,90


Now this looks way better and makes more sense so whenever you wanna use an index rather than the boring numbers ***just pass a list of indices to the index arguments***

### Exploring the data

When handling data in pandas we're usually looking for specific types of data or let's say "insights" so let's take another look into our employees dataframe and let's say we want only employees over some certain age or just the jobs column so let's quickly go over those and how to break down our data

In [8]:
# head() method allows us to retrieve only first five entries in a dataframe 
employees.head()

Unnamed: 0,name,job,age
0,Joe,Data scientist,26
1,Peter,Python developer,25
2,Chris,Java developer,31
3,Quagmire,C developer,29


In [11]:
# describe() method provides useful stats about your data and a great way to have an overall view of your data
employees.describe() 

Unnamed: 0,age
count,4.0
mean,27.75
std,2.753785
min,25.0
25%,25.75
50%,27.5
75%,29.5
max,31.0


In [13]:
# For numeric data we use value_counts() to find out how many times did a value happen

employees.age.value_counts()

31    1
29    1
26    1
25    1
Name: age, dtype: int64

In [14]:
# We can also access columns in two ways depending on which one you prefer more
# First being by dot notation
employees.job

0      Data scientist
1    Python developer
2      Java developer
3         C developer
Name: job, dtype: object

In [15]:
#Second being using a pair of square brackets
employees['job']

0      Data scientist
1    Python developer
2      Java developer
3         C developer
Name: job, dtype: object

### iloc() and loc()

The most essential and prime methods of pandas are those two methods, they are used to obtain certain portions of data the way we want them to be 

- loc() retrieves row-first column-second and works based on labels so we have to provide column names

- iloc() is index based locating method so we provide an index of the column we want

Examples below

In [16]:
employees.loc[:,'age'] # retrieving only the age column FOR ALL ROWS hence the colon 

0    26
1    25
2    31
3    29
Name: age, dtype: int64

In [19]:
employees.loc[:2,'age'] # only the first two entries of the age column

0    26
1    25
2    31
Name: age, dtype: int64

In [20]:
employees.loc[:,['age','job']] # passing a list of columns to be retrieved

Unnamed: 0,age,job
0,26,Data scientist
1,25,Python developer
2,31,Java developer
3,29,C developer


In [23]:
employees.loc[:2,['age','job']] # first two rows of the selected columns

Unnamed: 0,age,job
0,26,Data scientist
1,25,Python developer
2,31,Java developer


In [24]:
employees.iloc[:,0] # using iloc() to retrieve all rows of the column with index 0

0         Joe
1       Peter
2       Chris
3    Quagmire
Name: name, dtype: object

In [26]:
employees.iloc[:,[0,1]] # passig a list of indices of columns 

Unnamed: 0,name,job
0,Joe,Data scientist
1,Peter,Python developer
2,Chris,Java developer
3,Quagmire,C developer


### Creating your first series

A series is basically a column of data we can create in pandas and a group of series make up a dataframe

***how to create a series***

In [27]:
temp = pd.Series([21,25,26,24,29,31])
temp

0    21
1    25
2    26
3    24
4    29
5    31
dtype: int64

In [28]:
# we can also use an index or a list of indices with our series

temp = pd.Series([21,25,26,24,29,31],index=['sunday','monday','tuesday','wednesday','thursday','friday'])
temp

sunday       21
monday       25
tuesday      26
wednesday    24
thursday     29
friday       31
dtype: int64

We'll focus on dataframes mostly and in the next cell we'll import our first real life dataset from the web so you better be excited

### Real life dataset

Real data science projects make use of real life datasets to conduct analysis and other operations on them and for this task we're going to be using a dataset i found on [this repo](https://github.com/fivethirtyeight/data) which is a list of movies by Tarantino with the most swear words and do some advanced operations on them

In [3]:
films = pd.read_csv(r"data/tarantino.csv") #link to the csv file
films.head()

Unnamed: 0,movie,type,word,minutes_in
0,Reservoir Dogs,word,dick,0.4
1,Reservoir Dogs,word,dicks,0.43
2,Reservoir Dogs,word,fucked,0.55
3,Reservoir Dogs,word,fucking,0.61
4,Reservoir Dogs,word,bullshit,0.61


### Advanced selection and sorting

First thing we will do is learn how to perform advanced selection for more accurate results from our data, let's say we want a unique listing of the movies included in the dataset, to do so we use the method `unique()` as the example below

In [4]:
films.movie.unique()

array(['Reservoir Dogs', 'Pulp Fiction', 'Kill Bill: Vol. 1',
       'Kill Bill: Vol. 2', 'Inglorious Basterds', 'Django Unchained',
       'Jackie Brown'], dtype=object)

These following examples will only try to show you ways of using an if statement in your selection and also chained conditions

In [5]:
films.loc[(films.movie=='Django Unchained') & (films.type == 'word')]

Unnamed: 0,movie,type,word,minutes_in
1213,Django Unchained,word,goddamn,7.23
1215,Django Unchained,word,goddamn,8.37
1216,Django Unchained,word,bitch,8.38
1217,Django Unchained,word,damn,8.65
1218,Django Unchained,word,fucking,8.87
...,...,...,...,...
1516,Django Unchained,word,n-word,159.80
1517,Django Unchained,word,fucked,159.83
1518,Django Unchained,word,n-word,159.88
1519,Django Unchained,word,n-word,160.20


In [36]:
films.minutes_in.sort_values(ascending = False) # SORTING A SERIES

1521    160.45
1520    160.28
1519    160.20
1518    159.88
1517    159.83
         ...  
3         0.61
2         0.55
431       0.52
1         0.43
0         0.40
Name: minutes_in, Length: 1894, dtype: float64

### apply and map

We use `map()` on a series to help us run a function on it like in this example below 

In [40]:
films.type.map(lambda x:x == 'word')

0        True
1        True
2        True
3        True
4        True
        ...  
1889     True
1890     True
1891     True
1892     True
1893    False
Name: type, Length: 1894, dtype: bool

We use `apply()` to run a function on an entire dataframe

In [45]:
def new_value(df):
    df.type = 'some_value'
    return df

films.apply(new_value,axis = 'columns') # we changed the entire 'type' column into a new value we specified

Unnamed: 0,movie,type,word,minutes_in
0,Reservoir Dogs,some_value,dick,0.40
1,Reservoir Dogs,some_value,dicks,0.43
2,Reservoir Dogs,some_value,fucked,0.55
3,Reservoir Dogs,some_value,fucking,0.61
4,Reservoir Dogs,some_value,bullshit,0.61
...,...,...,...,...
1889,Jackie Brown,some_value,motherfucker,141.93
1890,Jackie Brown,some_value,ass,142.43
1891,Jackie Brown,some_value,fucking,142.47
1892,Jackie Brown,some_value,goddamn,142.97


In [46]:
def when_did_it_happen(row):
    if row.minutes_in >= 0 and row.minutes_in <= 50:
        row.minutes_in = 'early'
    elif row.minutes_in == 50 and row.minutes_in <= 100:
        row.minutes_in = 'halfway through'
    else:
        row.minutes_in = 'later'
    return row

films.apply(when_did_it_happen,axis='columns')

Unnamed: 0,movie,type,word,minutes_in
0,Reservoir Dogs,word,dick,early
1,Reservoir Dogs,word,dicks,early
2,Reservoir Dogs,word,fucked,early
3,Reservoir Dogs,word,fucking,early
4,Reservoir Dogs,word,bullshit,early
...,...,...,...,...
1889,Jackie Brown,word,motherfucker,later
1890,Jackie Brown,word,ass,later
1891,Jackie Brown,word,fucking,later
1892,Jackie Brown,word,goddamn,later


### Grouping data

Pandas allows us to group data together so we can extract useful info regarding those grouped items only, for example we're going to group a movie and a type together then will use a method called `agg()` on the `minutes_in` column which can take a list of functions to apply them to the grouped dataframe to find out each type of event that happened and when

In [47]:
films.groupby(['movie','type']).minutes_in.agg(['min','max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max
movie,type,Unnamed: 2_level_1,Unnamed: 3_level_1
Django Unchained,death,7.75,160.45
Django Unchained,word,7.23,160.28
Inglorious Basterds,death,20.35,147.37
Inglorious Basterds,word,22.2,148.73
Jackie Brown,death,23.08,143.13
Jackie Brown,word,4.38,142.97
Kill Bill: Vol. 1,death,13.53,97.6
Kill Bill: Vol. 1,word,7.12,100.1
Kill Bill: Vol. 2,death,14.38,121.12
Kill Bill: Vol. 2,word,1.18,119.58
