<big><i>
All the Notebooks in this lecture series by **[Abdul Aziz MD](https://www.linkedin.com/in/abdul-aziz-md/)**
</i></big>

<center><h1>Pandas for EDA</h1></center>


* In the previous ``notebook``, we dove into detail on NumPy and its ``ndarray`` object, which provides efficient storage and manipulation of dense typed arrays in Python.
* Now in this ``notebook``, we'll build the knowledge by looking in detail at the data structures provided by the Pandas library.
* Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``.
* ``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

In [1]:
import pandas as pd
print(pd.__version__)

2.2.2


Let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.

---
We will start our code sessions with the standard NumPy and Pandas imports:

In [2]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.


### Constructing Series objects

```
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the ``Series`` gives both a sequence of values and a sequence of indices, by which we can access with the ``values`` and ``index`` attributes.
* The ``values`` are simply like a NumPy array:

In [4]:
data1 = pd.Series([25, 5, 0.75, 100])
data1

0     25.00
1      5.00
2      0.75
3    100.00
dtype: float64

In [6]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The ``index`` is an array-like object of type ``pd.Index``.

In [7]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the Python square-bracket notation:

In [8]:
data[1]

0.5

In [9]:
data[:3]

0    0.25
1    0.50
2    0.75
dtype: float64

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [10]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [11]:
data['b'] #we can access the values by index

0.5

In [12]:
data[:'b']

a    0.25
b    0.50
dtype: float64

##  Pandas `DataFrame`
A ``DataFrame`` is of a two-dimensional array with both flexible row indices and flexible column names.

``DataFrame()`` function is used to create a dataframe in Pandas. The syntax of creating dataframe is:

``pandas.DataFrame(data, index, columns)``
where,

* ``data:`` It is a dataset from which dataframe is to be created. It can be list, dictionary, scalar value, series, ndarrays, etc.

* ``index:`` It is optional, by default the index of the dataframe starts from 0 and ends at the last data value(n-1). It defines the row label explicitly.

* ``columns:`` This parameter is used to provide column names in the dataframe. If the column name is not defined by default, it will take a value from 0 to n-1.

In [14]:
matrix = np.random.randint(1,20,size=20).reshape(5,4)

In [15]:
df = pd.DataFrame(matrix)
df

Unnamed: 0,0,1,2,3
0,14,18,2,6
1,1,14,11,8
2,8,11,9,1
3,10,17,2,8
4,13,18,15,11


In [16]:
matrix_data = np.random.randint(1,20,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe data frame looks like\n",'-'*45, sep='')
df


The data frame looks like
---------------------------------------------


Unnamed: 0,W,X,Y,Z
A,8,2,8,4
B,16,2,14,14
C,3,15,12,17
D,18,8,14,14
E,12,19,9,6


In [17]:
df.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [18]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [23]:
df.shape

(5, 4)

In [26]:
df[["W"]].shape

(5, 1)

In [24]:
df['W'].shape

(5,)

In [21]:
df1 = df[["W","Y"]]
df1

Unnamed: 0,W,Y
A,8,8
B,16,14
C,3,12
D,18,14
E,12,9


In [27]:
df1=df[['W','X']]
df1

Unnamed: 0,W,X
A,8,2
B,16,2
C,3,15
D,18,8
E,12,19


In [28]:
df.loc["A"]

W    8
X    2
Y    8
Z    4
Name: A, dtype: int32

In [29]:
df.loc[['A','B']] #loc is to access the rows in dataframe.

Unnamed: 0,W,X,Y,Z
A,8,2,8,4
B,16,2,14,14


In [31]:
# Creating DataFrame from dict of ndarray/lists: 

# intialise data of lists.
data = {'Name':['aziz', 'abdul', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data,[1,2,3,4])
 
# Print the output.
df

Unnamed: 0,Name,Age
1,aziz,20
2,abdul,21
3,krish,19
4,jack,18


``Column Selection:`` In Order to select a column in Pandas DataFrame, 
we can either access the columns by calling them by their columns 
name.


In [32]:

# Define a dictionary containing employee data
data = {'Name':['Abdul', 'Aziz', 'Gaurav', 'Anju'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Andhra', 'Tamilnadu', 'Kerala'],
        'Qualification':['MTech', 'BTech', 'MCA', 'Phd']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data,['A','B','C','D'])

#print full dataframe

print(df)
print('*'*40)
# select two columns
print(df[['Name', 'Qualification']])

     Name  Age    Address Qualification
A   Abdul   27      Delhi         MTech
B    Aziz   24     Andhra         BTech
C  Gaurav   22  Tamilnadu           MCA
D    Anju   32     Kerala           Phd
****************************************
     Name Qualification
A   Abdul         MTech
B    Aziz         BTech
C  Gaurav           MCA
D    Anju           Phd


In [33]:
# retrieving row by loc method

first = df.loc["A"]

print(first)

Name             Abdul
Age                 27
Address          Delhi
Qualification    MTech
Name: A, dtype: object


## Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

In [39]:
data_df = pd.read_csv ("C:\\Users\\HP\\Desktop\\VSM AI CSE A_B\\data.csv")
data_df

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4


### Quick checking DataFrames
* `.head()`
* `.tail()`
* `.sample()`
* `.info()`
* `.describe()`

In [41]:
data_df.head(10) #provides first 5 samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


In [None]:
data_df.head(10) #provides first 3 samples

In [42]:
data_df.tail() #provides last 5 samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


In [43]:
data_df.tail(7)# provides last 7 samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
162,45,95,130,270.0
163,45,100,140,280.9
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


In [49]:
data_df.sample(10) # provide any 10 ramdom samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
38,60,100,120,300.0
92,30,90,107,105.3
58,20,153,172,226.4
73,150,97,127,953.2
65,180,90,130,800.4
96,30,95,128,128.2
161,45,90,130,260.4
147,60,112,146,361.9
33,60,93,113,223.0
21,45,100,119,282.0


The ``data_df.info()`` method prints information about the ``DataFrame``. The information contains the number of ``columns``, ``column labels``, ``column data types``, ``memory usage``, range index, and the number of cells in each column.

In [46]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


---
The ``df.describe()`` method returns description of the data in the ``DataFrame``.

> If the `DataFrame` contains numerical data, the description contains these information for each column:

* count - The number of not-empty values.
* mean - The average (mean) value.
* std - The standard deviation.
* min - the minimum value.
* 25% - The 25% percentile*.
* 50% - The 50% percentile*.
* 75% - The 75% percentile*.
* max - the maximum value.

In [50]:
data_df.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,169.0,169.0,169.0,164.0
mean,63.846154,107.461538,134.047337,375.790244
std,42.299949,14.510259,16.450434,266.379919
min,15.0,80.0,100.0,50.3
25%,45.0,100.0,124.0,250.925
50%,60.0,105.0,131.0,318.6
75%,60.0,111.0,141.0,387.6
max,300.0,159.0,184.0,1860.4


In [51]:
data_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Duration,169.0,63.846154,42.299949,15.0,45.0,60.0,60.0,300.0
Pulse,169.0,107.461538,14.510259,80.0,100.0,105.0,111.0,159.0
Maxpulse,169.0,134.047337,16.450434,100.0,124.0,131.0,141.0,184.0
Calories,164.0,375.790244,266.379919,50.3,250.925,318.6,387.6,1860.4


> ``This is the short notebbok to get femilier with Pandas. In the next notebook we will work on real time data set to do data analysis. ``

## References
* https://github.com/donnemartin/data-science-ipython-notebooks/tree/master/pandas
* https://github.com/LearnDataSci/articles/tree/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners

<center><h1> Happy Learning.