<h2 style="text-align: center;">Basics of Data Science Using Python</h2>

In this session, we will learn the basics of data science using Python. In the previous sessions, we already discussed about the basics of data science and the applications of data science. So, we will move forward to today's session where we will be covering many python modules and their applications.

The modules covered in this session are,
<ol>
    <li>Numpy</li>
    <li>Pandas</li>
    <li>Matplotlib</li>
</ol>

<h3 style="text-align: center;"> 2. Learning Pandas </h3>

#### What is Pandas?
<p> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.</p>

#### What are we going to learn?
<p>We are going to learn the following </p>
<ol>
    <li>Pandas Object</li>
    <li>Data Indexing and Selection</li>
    <li>Operation on Data</li>
    <li>Handling Missing Data</li>
    <li>Operation on Null Values</li>
    <li>Working With Time Series</li>
</ol>

### Import Pandas 

In [1]:
import pandas as pd
# additional import
import numpy as np

### 1. Pandas Object

#### Constructing Pandas Series Object
<p>A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:</p>

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print('Series:\n',data)

print('Values of the series:',data.values)
print('Index of the series:',data.index)
print('\n')

#Data can also be accessed line in NumPy
print('Selecting single element:',data[1])
print('Selecting multiple element:',data[1:3])

Series:
 0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
Values of the series: [0.25 0.5  0.75 1.  ]
Index of the series: RangeIndex(start=0, stop=4, step=1)


Selecting single element: 0.5
Selecting multiple element: 1    0.50
2    0.75
dtype: float64


#### Constructing Dataframe Object

In [3]:
population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 
                   'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict) 
print('Population Series:\n', population)

Population Series:
 California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [4]:
df = pd.DataFrame(population, columns=['population'])
df

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., “not a number”) values:

In [5]:
pd.DataFrame([
    {'a': 1, 'b': 2}, 
    {'b': 3, 'c': 4}, 
    {'a': 3, 'c': 5, 'd': 7}
])

Unnamed: 0,a,b,c,d
0,1.0,2.0,,
1,,3.0,4.0,
2,3.0,,5.0,7.0


### 2. Data Indexing and Selection
Here we’ll look at the means of accessing and modifying values in Pandas Series and DataFrame objects. Since we have already learnt various NumPy patterns, we will feel the corresponding patterns in Pandas very familiar, though there are a few quirks to be aware of. 

In [6]:
df = pd.Series([0.25, 0.5, 0.75, 1.0], 
                  index=['a', 'b', 'c', 'd'])
print('Data at index 1 -> ', data[1])
print('Keys of the dataframe df -> ', data.keys())
print('Items in dataframe df as list -> ', list(data.items()))
# ADDING DATA IN DATAFRAME
df['e'] = 12
print('New dataframe df:\n', df, '\n')

# SLICING BY INTEGER INDEX
df2 = df[1:3]
print('Sliced dataframe:\n', df2, '\n')

# MASKING
df3 = df[(df>0.3) & (df<0.8)]
print('Masked dataframe:\n', df3, '\n')

# FANCY INDEXING
df4 = data[[0, 3]]
print('Dancy indexed dataframe:\n', df4)

Data at index 1 ->  0.5
Keys of the dataframe df ->  RangeIndex(start=0, stop=4, step=1)
Items in dataframe df as list ->  [(0, 0.25), (1, 0.5), (2, 0.75), (3, 1.0)]
New dataframe df:
 a     0.25
b     0.50
c     0.75
d     1.00
e    12.00
dtype: float64 

Sliced dataframe:
 b    0.50
c    0.75
dtype: float64 

Masked dataframe:
 b    0.50
c    0.75
dtype: float64 

Dancy indexed dataframe:
 0    0.25
3    1.00
dtype: float64


### 3. Operating on Data
#### Index Preservation

In [7]:
rng = np.random.RandomState(42)
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), 
                  columns=['A', 'B', 'C', 'D']) 
print('Original df:\n', df, '\n')
df2 = np.sin(df * 5 * np.cos(1))
print('Modified df:\n', df2)

Original df:
    A  B  C  D
0  6  3  7  4
1  6  9  2  6
2  7  4  3  7 

Modified df:
           A         B         C         D
0 -0.480396  0.968775  0.060987 -0.982093
1 -0.480396 -0.730557 -0.770842 -0.480396
2  0.060987 -0.982093  0.968775  0.060987


#### Index Alignment

In [8]:
# INDEX ALIGNMENT IN SERIES
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')       
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population') 

population_density = population / area
print('Population per square kilometer:\n', population_density)

Population per square kilometer:
 Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64


In [9]:
# INDEX ALIGNMENT IN DATAFRAME
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),columns=list('PR')) 
B = pd.DataFrame(rng.randint(0, 10, (3, 4)),columns=list('QPRW')) 

C = A+B
print('New dataframe:\n', C)

New dataframe:
       P   Q    R   W
0   6.0 NaN  1.0 NaN
1  19.0 NaN  5.0 NaN
2   NaN NaN  NaN NaN


<table style="text-align: centre;">
    <tr><td><b>Python Operator</b></td><td><b>Pandas Method</b></td></tr>
    <tr><td>+</td><td>add()</td></tr>
    <tr><td>-</td><td>sub(), subtract()</td></tr>
    <tr><td>*</td><td>mul(), multiply()</td></tr>
    <tr><td>/</td><td>truediv(), div()</td></tr>
    <tr><td>//</td><td>floordiv()</td></tr>
    <tr><td>%</td><td>mod()</td></tr>
    <tr><td>**</td><td>pow()</td></tr>
</table>

<p>This is the list of Pandas operators. Feel free to experiment with them.</p>

### 4. Handling Missing Data
#### Missing Numerical Data
<p>NaN is a bit like a data virus—it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN.</p>

In [10]:
val = np.array([1, np.nan, 3, 4])

print('Array datatype: ', val.dtype)
print('1 + np.nan = ', 1+np.nan)
print('0 * np.nan = ', 0*np.nan)
print('Sum, Max and Min of array val: ', val.sum(), val.max(), val.min())

# NumPy does provide some special aggregations that will ignore these missing values
print('Sum, Max and Min of array val: ', np.nansum(val), np.nanmax(val), np.nanmin(val))

Array datatype:  float64
1 + np.nan =  nan
0 * np.nan =  nan
Sum, Max and Min of array val:  nan nan nan
Sum, Max and Min of array val:  8.0 4.0 1.0


<p>NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, 
converting between them where appropriate</p>

In [11]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

### 5. Operating on Null values
As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:
<table>
    <tr><th>Method</th><th>What it does</th></td>
    <tr><td>isnull() </td><td> Generate a Boolean mask indicating missing values </td></tr>
    <tr><td>notnull() </td><td> Opposite of isnull() </td></tr>
    <tr><td>dropna() </td><td> Return a filtered version of the data </td></tr>
    <tr><td>fillna() </td><td> Return a copy of the data with missing values filled or imputed</td></tr>
</table>

### Droping Null values

In [12]:
data = pd.Series([1, np.nan, 'hello', None])

# to check null -> data.isnull()
print('Check Null:\n', data.isnull(), '\n')
# to print non null values -> data[data.notnull()]
# to drop the null values -> data.dropna()

df = pd.DataFrame([[1, np.nan, 2], 
                    [2, 3, 5], 
                    [np.nan, 4, 6],
                    [1, 7, 2]], columns=['col1', 'col2', 'col3']) 
print('Original Dataframe:\n', df, '\n')
df_dropped = df.dropna()
print('Dataframe After Removing Rows Containing NULL Values:\n', df_dropped)

Check Null:
 0    False
1     True
2    False
3     True
dtype: bool 

Original Dataframe:
    col1  col2  col3
0   1.0   NaN     2
1   2.0   3.0     5
2   NaN   4.0     6
3   1.0   7.0     2 

Dataframe After Removing Rows Containing NULL Values:
    col1  col2  col3
1   2.0   3.0     5
3   1.0   7.0     2


### Filling null values 
Sometimes rather than dropping NA values, you’d rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.

In [13]:
df_filled = df.fillna(0)
print('Filled dataframe:\n', df_filled, '\n')

df_forward_fill = df.fillna(method='ffill', axis=1)
print('Forward Filled dataframe:\n', df_forward_fill, '\n')

df_backward_fill = df.fillna(method='ffill')
print('Backward Filled dataframe:\n', df_backward_fill, '\n')


Filled dataframe:
    col1  col2  col3
0   1.0   0.0     2
1   2.0   3.0     5
2   0.0   4.0     6
3   1.0   7.0     2 

Forward Filled dataframe:
    col1  col2  col3
0   1.0   1.0   2.0
1   2.0   3.0   5.0
2   NaN   4.0   6.0
3   1.0   7.0   2.0 

Backward Filled dataframe:
    col1  col2  col3
0   1.0   NaN     2
1   2.0   3.0     5
2   2.0   4.0     6
3   1.0   7.0     2 



### 6. Working With Time Series

#### Date and Time in Python : Native Python dates and times: datetime and dateutil 

In [14]:
from datetime import datetime
from dateutil import parser

In [15]:
# WE CAN MANUALLY BUILD A DATE USING THE datetime TYPE
print(datetime(year=2020, month=11, day=9))
# OR USING dateutil MODULE, WE CAN PARSE DATE FROM VARIETY OF STRING FORMATS
date = parser.parse("9th of November, 2020")
print(date)

2020-11-09 00:00:00
2020-11-09 00:00:00


In [16]:
# FROM DATETIME OBJECT WE CAN DO MANY THINGS, LIKE PRINTING THE DAY OF THE WEEK
date.strftime("%A")

'Monday'

In [17]:
# DATETIME USING NUMPY
date = np.array('2020-11-09', dtype=np.datetime64) 
print(date, date.dtype)

2020-11-09 datetime64[D]


Once we have this date formatted, however, we can quickly do vectorized operations on it:

In [18]:
index = pd.DatetimeIndex(date + np.arange(12))
index

DatetimeIndex(['2020-11-09', '2020-11-10', '2020-11-11', '2020-11-12',
               '2020-11-13', '2020-11-14', '2020-11-15', '2020-11-16',
               '2020-11-17', '2020-11-18', '2020-11-19', '2020-11-20'],
              dtype='datetime64[ns]', freq=None)

In [19]:
df = pd.Series(range(1,13), index=index)
df

2020-11-09     1
2020-11-10     2
2020-11-11     3
2020-11-12     4
2020-11-13     5
2020-11-14     6
2020-11-15     7
2020-11-16     8
2020-11-17     9
2020-11-18    10
2020-11-19    11
2020-11-20    12
dtype: int64

### pd.date_range()
To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose: pd.date_range() for timestamps, pd.period_range() for periods, and pd.timedelta_range() for time deltas.

In [20]:
pd.date_range('2020-11-09', '2020-11-15')

DatetimeIndex(['2020-11-09', '2020-11-10', '2020-11-11', '2020-11-12',
               '2020-11-13', '2020-11-14', '2020-11-15'],
              dtype='datetime64[ns]', freq='D')

In [21]:
pd.date_range('2020-11-09', periods=8)

DatetimeIndex(['2020-11-09', '2020-11-10', '2020-11-11', '2020-11-12',
               '2020-11-13', '2020-11-14', '2020-11-15', '2020-11-16'],
              dtype='datetime64[ns]', freq='D')

In [22]:
pd.period_range('2020-11-08', periods=8, freq='M')

PeriodIndex(['2020-11', '2020-12', '2021-01', '2021-02', '2021-03', '2021-04',
             '2021-05', '2021-06'],
            dtype='period[M]', freq='M')

In [23]:
pd.timedelta_range(0, periods=10, freq='H')

TimedeltaIndex(['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00',
                '05:00:00', '06:00:00', '07:00:00', '08:00:00', '09:00:00'],
               dtype='timedelta64[ns]', freq='H')

#### With this, we conclude this session. In the next session, we will learn about the module matplotlib.