# Table of Content


1. **[Importing Modules (Pandas)](#pandas)**
<br><br> 
2. **[Pandas DataFrame](#dataframes)**
<br><br>
3. **[Manipulating DataFrame](#dataframes)**
<br><br>
4. **[Reading Data from Different Sources](#reading_data)**



<a id="pandas"> </a>
# 1. Pandas

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.
<br><br>
                       Pandas is designed for working with tabular data.<br><br>

A module/library in python is simply a way to organize the code, and it contains either python classes or just functions. 

### Quick look at functions

In [6]:
# what are functions and arguments
def add(x,y):
    c = x+y
    return c

In [2]:
#Explain functions and how import works to students
# import inspect
# print(inspect.getsource(add.ad))

**How to install and import pandas?**<br>
1. Install pandas:<br><br>
`!pip install pandas`<br><br>
2. Import pandas:<br><br>
`import pandas as pd`

In [81]:
#Check the list of base packages
!pip list

In [None]:
# install pandas
# !pip install pandas

In [10]:
#import pandas 'library/package/modules'
import pandas as pd

 `as` is used as an alias in pandas. So from now on we will use `pd.` instead of `pandas.` 
 
<br>
<span style="color:crimson">Always use libraries if they are freely available. It saves time, and those codes are already tested, debugged and optimized.</span>

<a id="dataframes"> </a>
# 2. Pandas DataFrames

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> A DataFrame is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (numeric, string, boolean, and so on). <br><br>                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## <span style="color:darkgreen;">To read data from a csv file</span>

In [11]:
# read the example.csv file in a dataframe
data = pd.read_csv('example.csv')
data.head(2)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21


In [55]:
# check the type
type(data)

pandas.core.frame.DataFrame

On checking the data type, we notice it is read as pandas data frame.

## <span style="color:darkgreen;">To print top & bottom rows of the data</span>

In [5]:
#top rows
data.head(2)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21


By default, the `.head()` will display **first** five rows. However, we can set the desired number of rows to be displayed.

In [7]:
#bottom rows
data.tail(2)

Unnamed: 0,Age,Weight (in kg),Height (in m)
21,56,76,1.69
22,67,78,1.85


By default, the `.tail()` will display **last** five rows. However, we can set the desired number of rows to be displayed.

## <span style="color:darkgreen;">To obtain the dimensions of the data</span>

In [60]:
data.shape

(23, 3)

## <span style="color:darkgreen;">To know the data types of a data frame</span>

In [153]:
data.dtypes

Age                 int64
Weight (in kg)      int64
Height (in m)     float64
BMI               float64
dtype: object

We see the data type of each variable.

## <span style="color:darkgreen;">Print more information about the data</span>

In [65]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             23 non-null     int64  
 1   Weight (in kg)  23 non-null     int64  
 2   Height (in m)   23 non-null     float64
dtypes: float64(1), int64(2)
memory usage: 680.0 bytes


We see this output gives the number of rows present in the data `RangeIndex: 23 entries, 0 to 22` There are 23 rows numbered from 0 to 22. And there are a total of three columns - `Data columns (total 3 columns)`. 

Consider `Age 23 non-null int64` indicates that the column named 'Age' has 23 non-null observations having the data type 'int64'

And finally the memory used to save this dataframe is 680 bytes.

In [None]:
# describe your data
data.describe()

# `.loc` and `.iloc` methods

## <span style="color:darkgreen;">Indexing a dataframe using `.loc`</span>

`DataFrame.loc[]` is label-based method, which means that you have to specify rows and columns based on their row and column labels.

In [None]:
#syntax
dataframe.loc[rows, columns]

In [9]:
data.loc[0,]

Age               45.000000
Weight (in kg)    60.000000
Height (in m)      1.350000
BMI               32.921811
Name: 0, dtype: float64

In [10]:
data.loc[0,'Age']

45.0

## <span style="color:darkgreen;">Selecting multiple rows</span>

In [37]:
data.loc[[4,7,10]]

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
4,68.0,50.0,1.32,28.696051
7,57.0,34.0,1.61,13.116778
10,23.0,53.0,1.5,23.555556


We use two square brackets since we are passing a list of row numbers to be accessed.

## <span style="color:darkgreen;">Selecting a range of rows</span>

In [48]:
data.loc[12:17]

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
12,55.0,89.0,1.65,32.690542
13,23.0,45.0,1.75,14.693878
14,56.0,76.0,1.69,26.609713
15,67.0,78.0,1.85,22.790358
16,26.0,65.0,1.21,44.395875
17,56.0,74.0,1.69,25.909457


## <span style="color:darkgreen;">Selecting the first column</span>

In [46]:
#data.loc[ro:ws,col:umns]
data.loc[:,'Age']

0     45.0
1     12.0
2     54.0
3     26.0
4     68.0
5     21.0
6     10.0
7     57.0
8     75.0
9     32.0
10    23.0
11    34.0
12    55.0
13    23.0
14    56.0
15    67.0
16    26.0
17    56.0
18    67.0
19    26.0
20    68.0
21    56.0
22    67.0
23    56.0
Name: Age, dtype: float64

To select the last column we use -1, to select the second last column we use -2

## <span style="color:darkgreen;">Select the first two columns</span>

In [51]:
data.loc[:,['Age','BMI']]

Unnamed: 0,Age,BMI
0,45.0,32.921811
1,12.0,29.369579
2,54.0,34.666667
3,26.0,44.395875
4,68.0,28.696051
5,21.0,18.611496
6,10.0,11.753903
7,57.0,13.116778
8,75.0,14.958377
9,32.0,9.089335


## <span style="color:darkgreen;">Indexing a dataframe using `.iloc`</span>

`DataFrame.iloc[]` is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

**Note:** the row names are numbers 

In [53]:
# using iloc select 'Age' & 'Weight (in kg)'
data.iloc[:,0:2]

Unnamed: 0,Age,Weight (in kg)
0,45.0,60.0
1,12.0,43.0
2,54.0,78.0
3,26.0,65.0
4,68.0,50.0
5,21.0,43.0
6,10.0,32.0
7,57.0,34.0
8,75.0,23.0
9,32.0,21.0


<a id="manipulatingDF"> </a>
# 3. Manipulating a Dataframe

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> CAUTION:<br>
                        1. DataFrame[column] works for any column name, but DataFrame.column only works when the column name is a valid Python variable name.<br>
                        2. New columns cannot be created with the ` data.BMI ` syntax.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## <span style="color:darkgreen;">Adding a new column to the dataframe</span>

columns/variables/features mean the same

In [4]:
# create a new column BMI which is given by weight / H**2
data['BMI'] = data['Weight (in kg)']/ data['Height (in m)']**2
data.head()

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051


In [5]:
# check the shape of the data
data.shape

(23, 4)

## <span style="color:darkgreen;">Adding a new row to the dataframe</span>

In [7]:
data.loc[23] = [56, 76, 1.69, 26.609713]
data.head(2)

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579


We see that a new row number 23 has be added to the data.

## <span style="color:darkgreen;">Sorting the dataframe</span>

In [None]:
pd??

In [1]:
# sort the data frame on basis of 'Age' values, by default the values will get sorted in ascending order
#Note: 'ascending = False' will sort the data frame in descending order.
data.sort_values('Age', ascending = True)

NameError: name 'data' is not defined

## <span style="color:darkgreen;">Droping Rows and Columns</span>

In [76]:
# To drop a column
data1 = data.drop('BMI', axis=1)

In [108]:
# dropping a row
data.drop(23)

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667
3,26.0,65.0,1.21,44.395875
4,68.0,50.0,1.32,28.696051
5,21.0,43.0,1.52,18.611496
6,10.0,32.0,1.65,11.753903
7,57.0,34.0,1.61,13.116778
8,75.0,23.0,1.24,14.958377
9,32.0,21.0,1.52,9.089335


## <span style="color:darkgreen;">Droping duplicates</span>

In [8]:
# Check if data has duplicates
duplicate_count = df.duplicated().sum()

In [107]:
# to drop duplicates from your data
data.drop_duplicates(inplace=True)
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667
3,26.0,65.0,1.21,44.395875
4,68.0,50.0,1.32,28.696051
5,21.0,43.0,1.52,18.611496
6,10.0,32.0,1.65,11.753903
7,57.0,34.0,1.61,13.116778
8,75.0,23.0,1.24,14.958377
9,32.0,21.0,1.52,9.089335


## <span style="color:darkgreen;">Checking for missing values</span>

Let's import a new dataset.

In [145]:
# Import missingdata.csv 
mdata = pd.read_csv('missingdata.csv')
mdata.head(2)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45.0,60.0,1.35
1,12.0,43.0,1.21


In [147]:
# check for nulls
mdata.info()
#mdata.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             22 non-null     float64
 1   Weight (in kg)  21 non-null     float64
 2   Height (in m)   22 non-null     float64
dtypes: float64(3)
memory usage: 680.0 bytes


The function `.isnull` check whether the data is missing. The `sum()` sums the number of 'True' values in the column. The final output gives the number of missing values in each column.

Here, we see there are 2 missing values in the 'weight' column and one missing value in other columns.

## <span style="color:crimson;">Take home exercise</span>

<a id="reading_data"> </a>
### Reading Data from Different Sources

Note that the files names are used as examples only. You can try importing your own files to execute the below examples.

**1. Read a `.xlsx` file**

`pd.read_excel('example.xlsx')`

**2. Read a `.txt` file**

`data = pd.read_csv('example.txt', sep="\t")`

**3. Read a `.zip` file**

`import zipfile
with zipfile.ZipFile('data.zip') as z:
    with z.open('example.csv') as f:
        file = pd.read_csv(f)
        print(file.head())`

**4. Read a `.html` file**

`df = pd.read_html('example.html', header=1, index_col=0)`

**5. Read a `.json` file**

`pd.read_json('example.json')`