# pandas

    Pandas stands for panel data structures. It is library built on top of numpy library.
    
    We have 3 kinds of data structures inside pandas.
    
    1. Index
    2. Series
    3. Dataframes
    
    Pandas is used in predictive analysis.

## `importing pandas`

In [1]:
import numpy as np
import pandas as pd

## `creating series`

- We can create pandas series using lists or dictionaries.

In [2]:
country = ['India', 'Pakistan', 'USA', 'Nepal', 'Srilanka']

pd.Series(country)

0       India
1    Pakistan
2         USA
3       Nepal
4    Srilanka
dtype: object

    We can also create a series where the values are integers as well.

In [3]:
runs = [13, 24, 56, 78, 100]
pd.Series(runs)

0     13
1     24
2     56
3     78
4    100
dtype: int64

- We can also give custom index to our series object.

In [4]:
marks = [67, 57, 89, 100]
subjects = ['Maths', 'Science', 'English', 'Hindi']

pd.Series(marks, index=subjects)

Maths       67
Science     57
English     89
Hindi      100
dtype: int64

- We may also a provide a `name` to our series.

In [5]:
marks = [67, 57, 89, 100]
subjects = ['Maths', 'Science', 'English', 'Hindi']

pd.Series(marks, index = subjects, name = 'Daniyaal marks')

Maths       67
Science     57
English     89
Hindi      100
Name: Daniyaal marks, dtype: int64

## `creating series from dictionaries`

    We can also create a series object by passing in a dictionary in the pd.Series() method.

In [6]:
marks_dictionary = {subject:score for subject, score in zip(subjects, marks)}
marks_dictionary

{'Maths': 67, 'Science': 57, 'English': 89, 'Hindi': 100}

In [16]:
marks_ser = pd.Series(marks_dictionary)
marks_ser

Maths       67
Science     57
English     89
Hindi      100
dtype: int64

## `attributes of series`

    Since at the end of day, series are python objects, each series instance has certain attributes associated to it.

In [17]:
marks_ser

Maths       67
Science     57
English     89
Hindi      100
dtype: int64

In [18]:
marks_ser.size

4

In [19]:
marks_ser.dtype

dtype('int64')

In [20]:
marks_ser.shape

(4,)

In [21]:
marks_ser.index

Index(['Maths', 'Science', 'English', 'Hindi'], dtype='object')

In [22]:
marks_ser.values

array([ 67,  57,  89, 100], dtype=int64)

In [24]:
marks_ser.is_unique # Tells if all the values are unique or not

True

In [27]:
pd.Series([1, 2, 3, 4, np.nan, 5, 6, np.nan]).is_unique

False

In [25]:
marks_ser.hasnans # Tells if the series has any missing values or not

False

In [26]:
pd.Series([1, 2, 3, 4, np.nan, 5, 6, np.nan]).hasnans

True

## `importing series from read_csv`

    The read_csv() method is used primarily for reading data frames. However, if our csv file has only 2 columns or
    1 columns, then we can squeeze it and get a series object.

In [29]:
pd.read_csv('./content/subs.csv')

Unnamed: 0,Subscribers gained
0,48
1,57
2,40
3,43
4,44
...,...
360,231
361,226
362,155
363,144


In [31]:
subs = pd.read_csv('./content/subs.csv').squeeze(True)
subs

0       48
1       57
2       40
3       43
4       44
      ... 
360    231
361    226
362    155
363    144
364    172
Name: Subscribers gained, Length: 365, dtype: int64

    When there are more than 1 column in the csv file, then we also need to tell pandas about which column to use as
    index column. We can do this by passing in the name of that column in the index_col parameter or we may also pass
    the column index of that column.

In [33]:
pd.read_csv('./content/kohli_ipl.csv', index_col = 0)

Unnamed: 0_level_0,runs
match_no,Unnamed: 1_level_1
1,1
2,23
3,13
4,12
5,1
...,...
211,0
212,20
213,73
214,25


In [42]:
virat_kohli_scores = pd.read_csv('./content/kohli_ipl.csv', index_col = 0).squeeze(True)
virat_kohli_scores

match_no
1       1
2      23
3      13
4      12
5       1
       ..
211     0
212    20
213    73
214    25
215     7
Name: runs, Length: 215, dtype: int64

In [45]:
movies = pd.read_csv('./content/bollywood.csv', index_col = 'movie').squeeze(True)
movies

movie
Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object

    We can remove the movie which is on the top of this column.

In [46]:
movies.index.name = None

In [47]:
movies

Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object

## `series methods`

    Since series are objects, we have certain methods associated with each series object.

### `count()`
- `count()` method gives us the number of non-missing values in a series. This is different from `size` attribute.
- The `size` attribute gives the total number of values.

In [48]:
exp = pd.Series([1, 2, 3, 4, np.nan, 6, 7, np.nan, 9, np.nan])
exp

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    6.0
6    7.0
7    NaN
8    9.0
9    NaN
dtype: float64

In [49]:
exp.size, exp.count()

(10, 7)

### `value_counts()`

- The `value_counts()` method is used to give the frequency of each value in the form of a series.
- The `value_counts()` method by default does not includes the count of `NaN`.
- If we want to include the count of `NaN`, then we need to pass `dropna = False`.

In [50]:
movies

Uri: The Surgical Strike                   Vicky Kaushal
Battalion 609                                Vicky Ahuja
The Accidental Prime Minister (film)         Anupam Kher
Why Cheat India                            Emraan Hashmi
Evening Shadows                         Mona Ambegaonkar
                                              ...       
Hum Tumhare Hain Sanam                    Shah Rukh Khan
Aankhen (2002 film)                     Amitabh Bachchan
Saathiya (film)                             Vivek Oberoi
Company (film)                                Ajay Devgn
Awara Paagal Deewana                        Akshay Kumar
Name: lead, Length: 1500, dtype: object

In [51]:
movies.value_counts()

lead
Akshay Kumar        48
Amitabh Bachchan    45
Ajay Devgn          38
Salman Khan         31
Sanjay Dutt         26
                    ..
Diganth              1
Parveen Kaur         1
Seema Azmi           1
Akanksha Puri        1
Edwin Fernandes      1
Name: count, Length: 566, dtype: int64

In [54]:
names = pd.Series(['Abhishek', 'Abhishek', 'Amrusha', 'Amrusha', 'Priyanka', 'Daniyaal', np.nan, np.nan,'Sameer'])
names

0    Abhishek
1    Abhishek
2     Amrusha
3     Amrusha
4    Priyanka
5    Daniyaal
6         NaN
7         NaN
8      Sameer
dtype: object

In [55]:
names.value_counts()

Abhishek    2
Amrusha     2
Priyanka    1
Daniyaal    1
Sameer      1
Name: count, dtype: int64

    As we can see that it does not include the count of missing values by default. However, we can include the count
    of missing values by setting dropna = False.

In [56]:
names.value_counts(dropna = False)

Abhishek    2
Amrusha     2
NaN         2
Priyanka    1
Daniyaal    1
Sameer      1
Name: count, dtype: int64