# <center>Introduction to Pandas</center>

![](https://pandas.pydata.org/_static/pandas_logo.png)


## Installation

Simply,
```
pip install pandas
```


## Reading data from a CSV file

You can read data from a CSV file using the ``read_csv`` function. By default, it assumes that the fields are comma-separated.

In [174]:
 import pandas as pd

>The `imdb.csv` dataset contains Highest Rated IMDb "Top 1000" Titles.

In [129]:
# load imdb dataset as pandas dataframe
df1=pd.read_csv('imdb_1000.csv')

In [130]:
# show first 5 rows of imdb_df
print(df1.head())

   star_rating                     title content_rating   genre  duration  \
0          9.3  The Shawshank Redemption              R   Crime       142   
1          9.2             The Godfather              R   Crime       175   
2          9.1    The Godfather: Part II              R   Crime       200   
3          9.0           The Dark Knight          PG-13  Action       152   
4          8.9              Pulp Fiction              R   Crime       154   

                                         actors_list  
0  [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...  
1    [u'Marlon Brando', u'Al Pacino', u'James Caan']  
2  [u'Al Pacino', u'Robert De Niro', u'Robert Duv...  
3  [u'Christian Bale', u'Heath Ledger', u'Aaron E...  
4  [u'John Travolta', u'Uma Thurman', u'Samuel L....  


>The `bikes.csv` dataset contains information about the number of bicycles that used certain bicycle lanes in Montreal in the year 2012.

In [131]:
# load bikes dataset as pandas dataframe
df2=pd.read_csv('bikes.csv')

In [132]:
# show first 3 rows of bikes_df
print(df2.head(3))

         Date Unnamed: 1  Rachel / Papineau  Berri1  Maisonneuve_2  \
0  01/01/2012      00:00                 16      35             51   
1  02/01/2012      00:00                 43      83            153   
2  03/01/2012      00:00                 58     135            248   

   Maisonneuve_1  Brébeuf  Parc  PierDup  CSC (Côte Sainte-Catherine)  \
0             38      5.0    26       10                            0   
1             68     11.0    53        6                            1   
2            104      2.0    89        3                            2   

   Pont_Jacques_Cartier  
0                  27.0  
1                  21.0  
2                  15.0  


## Selecting columns

When you read a CSV, you get a kind of object called a DataFrame, which is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

In [133]:
# list columns of imdb_df
print(df1.columns)

Index(['star_rating', 'title', 'content_rating', 'genre', 'duration',
       'actors_list'],
      dtype='object')


In [134]:
# what are the datatypes of values in columns
print(df1.dtypes)

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object


In [135]:
# list first 5 movie titles
print(df1['title'].head())

0    The Shawshank Redemption
1               The Godfather
2      The Godfather: Part II
3             The Dark Knight
4                Pulp Fiction
Name: title, dtype: object


In [136]:
# show only movie title and genre
print(df1[['title','genre']])

                                               title      genre
0                           The Shawshank Redemption      Crime
1                                      The Godfather      Crime
2                             The Godfather: Part II      Crime
3                                    The Dark Knight     Action
4                                       Pulp Fiction      Crime
..                                               ...        ...
974                                          Tootsie     Comedy
975                      Back to the Future Part III  Adventure
976  Master and Commander: The Far Side of the World     Action
977                                      Poltergeist     Horror
978                                      Wall Street      Crime

[979 rows x 2 columns]


## Understanding columns

On the inside, the type of a column is ``pd.Series`` and pandas Series are internally numpy arrays. If you add ``.values`` to the end of any Series, you'll get its internal **numpy array**.

In [137]:
# show the type of duration column
print(df1['duration'].dtype)

int64


In [138]:
# show duration values of movies as numpy arrays
print(df1['duration'].values)

[142 175 200 152 154  96 161 201 195 139 178 148 124 142 179 169 133 207
 146 121 136 130 130 106 127 116 175 118 110  87 125 112 102 107 119  87
 169 115 112 109 189 110 150 165 155 137 113 165  95 151 155 153 125 130
 116  89 137 117  88 165 170  89 146  99  98 116 156 122 149 134 122 136
 157 123 119 137 128 120 229 107 134 103 177 129 102 216 136  93  68 189
  99 108 113 181 103 138 110 129  88 160 126  91 116 125 143  93 102 132
 153 183 160 120 138 140 153 170 129  81 127 131 172 115 108 107 129 156
  96  91  95 162 130  86 186 151  96 170 118 161 131 126 131 129 224 180
 105 117 140 119 124 130 139 107 132 117 126 122 178 238 149 172  98 116
 116 123 148 123 182  92  93 100 135 105  94 140  83  95  98 143  99  98
 121 163 121 167 188 121 109 110 129 127  94 107 100 117 129 120 121 133
 111 122 101 134 165 138 212 154  89 134  93 114  88 130 101 158  99 108
 124 132 113 131 191 167 130 147 102  88 165 132 118 101 108 174  98  92
  98 106  85 101 105 115 115 124 105 103 138 184 12

## Applying functions to columns

Use `.apply` function to apply any function to each element of a column.

In [139]:
# convert all the movie titles to uppercase
def to_uppercase(s):
    return s.upper() 
df1['title']=df1['title'].apply(to_uppercase)
print(df1['title'].head())

0    THE SHAWSHANK REDEMPTION
1               THE GODFATHER
2      THE GODFATHER: PART II
3             THE DARK KNIGHT
4                PULP FICTION
Name: title, dtype: object


## Plotting a column

Use ``.plot()`` function!

In [146]:
%matplotlib inline

In [148]:
# plot the bikers travelling to Berri1 over the year
df2['Berri1'].plot()

<Axes: >

In [149]:
# plot all the columns of bikes_df
df2.plot(figsize=(10,7))

<Axes: >

## Value counts

Get count of unique values in a particular column/Series.

In [150]:
# what are the unique genre in imdb_df?
df1['genre'].value_counts()

genre
Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Sci-Fi         5
Thriller       5
Film-Noir      3
Family         2
History        1
Fantasy        1
Name: count, dtype: int64

In [151]:
# plotting value counts of unique genres as a bar chart
df1['genre'].value_counts().plot.bar()

<Axes: xlabel='genre'>

In [152]:
# plotting value counts of unique genres as a pie chart
df1['genre'].value_counts().plot.pie()

<Axes: xlabel='genre', ylabel='count'>

## Index

### DATAFRAME = COLUMNS + INDEX + ND DATA

### SERIES = INDEX + 1-D DATA

**Index** or (**row labels**) is one of the fundamental data structure of pandas. It can be thought of as an **immutable array** and an **ordered set**.

> Every row is uniquely identified by its index value.

In [155]:
# show index of bikes_df
print(df2.columns)

Index(['Date', 'Unnamed: 1', 'Rachel / Papineau', 'Berri1', 'Maisonneuve_2',
       'Maisonneuve_1', 'Brébeuf', 'Parc', 'PierDup',
       'CSC (Côte Sainte-Catherine)', 'Pont_Jacques_Cartier'],
      dtype='object')


In [159]:
# get row for date 2012-01-01
print(df2.loc['01/01/2012'])

Unnamed: 1                     00:00
Rachel / Papineau                 16
Berri1                            35
Maisonneuve_2                     51
Maisonneuve_1                     38
Brébeuf                          5.0
Parc                              26
PierDup                           10
CSC (Côte Sainte-Catherine)        0
Pont_Jacques_Cartier            27.0
Name: 01/01/2012, dtype: object


#### To get row by integer index:

Use ``.iloc[]`` for purely integer-location based indexing for selection by position.

In [161]:
# show 11th row of imdb_df using iloc
df2.iloc[10]

Unnamed: 1                     00:00
Rachel / Papineau                194
Berri1                           273
Maisonneuve_2                    443
Maisonneuve_1                    182
Brébeuf                          7.0
Parc                             258
PierDup                           12
CSC (Côte Sainte-Catherine)        0
Pont_Jacques_Cartier            20.0
Name: 11/01/2012, dtype: object

## Selecting rows where column has a particular value

In [162]:
# select only those movies where genre is adventure
df1[df1['genre']=='Adventure']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
7,8.9,THE LORD OF THE RINGS: THE RETURN OF THE KING,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
10,8.8,THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING,PG-13,Adventure,178,"[u'Elijah Wood', u'Ian McKellen', u'Orlando Bl..."
14,8.8,THE LORD OF THE RINGS: THE TWO TOWERS,PG-13,Adventure,179,"[u'Elijah Wood', u'Ian McKellen', u'Viggo Mort..."
15,8.7,INTERSTELLAR,PG-13,Adventure,169,"[u'Matthew McConaughey', u'Anne Hathaway', u'J..."
54,8.5,BACK TO THE FUTURE,PG,Adventure,116,"[u'Michael J. Fox', u'Christopher Lloyd', u'Le..."
...,...,...,...,...,...,...
936,7.4,TRUE GRIT,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"
937,7.4,LABYRINTH,PG,Adventure,101,"[u'David Bowie', u'Jennifer Connelly', u'Toby ..."
943,7.4,THE BUCKET LIST,PG-13,Adventure,97,"[u'Jack Nicholson', u'Morgan Freeman', u'Sean ..."
953,7.4,THE NEVERENDING STORY,PG,Adventure,102,"[u'Noah Hathaway', u'Barret Oliver', u'Tami St..."


In [166]:
# which genre has highest number of movies with star rating above 8 and duration more than 130 minutes?
df1[(df1['star_rating']>8) & (df1['duration']>130)]['genre']

0          Crime
1          Crime
2          Crime
3         Action
4          Crime
         ...    
273    Biography
288        Drama
289        Drama
290        Crime
296       Action
Name: genre, Length: 115, dtype: object

## Adding a new column to DataFrame

In [183]:
# add a weekday column to bikes_df
df2['weekday']=df2.index.weekday

## Deleting an existing column from DataFrame

In [None]:
# remove column 'Unnamed: 1' from bikes_df
df.drop('Unnamed: 1',axis=1,inplace=True)

## Deleting a row in DataFrame

In [169]:
# remove row no. 1 from bikes_df
df2.drop(df2.index[0],axis=0,inplace=True)

## Group By

Any groupby operation involves one of the following operations on the original object. They are −

- Splitting the Object

- Applying a function

- Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

- **Aggregation** − computing a summary statistic

- **Transformation** − perform some group-specific operation

- **Filtration** − discarding the data with some condition

In [193]:
# group imdb_df by movie genres
df3=df1.groupby('genre')

In [194]:
# get crime movies group
df3.get_group('Crime').head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,THE SHAWSHANK REDEMPTION,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,THE GODFATHER,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,THE GODFATHER: PART II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
4,8.9,PULP FICTION,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
21,8.7,CITY OF GOD,R,Crime,130,"[u'Alexandre Rodrigues', u'Matheus Nachtergael..."


In [197]:
# get mean of movie durations for each group
df3.aggregate('mean')

In [None]:
# change duration of all movies in a particular genre to mean duration of the group
df1['new_duration']=df3['duration'].transform(lambda x:x.mean())

In [None]:
# drop groups/genres that do not have average movie duration greater than 120.
new_df1=df3.filter(lambda x: x['duration'].mean()>120)

In [190]:
# group weekday wise bikers count
df4=df1.groupby('weekday')

In [192]:
# get weekday wise biker count
df5=df4.aggregate(sum)

In [None]:
# plot weekday wise biker count for 'Berri1'
df5['Berri1'].plot.bar()

![](https://memegenerator.net/img/instances/500x/73988569/pythonpandas-is-easy-import-and-go.jpg)