<div style="background-color: lightgray; padding: 18px;">
    <h1> Learning Python | Day 17
    
</div>

### Features:
Pandas
- Intro to Pandas DataFrames
- DataFrames
- Accessing elements
- Indexing and Selecting data
- Read/Write Files
- DataFrame Properties and Methods

<div style="background-color: lightgreen; padding: 10px;">
    <h2> Introduction
</div>

``Pandas`` is an open-source library that is built on top of ``NumPy`` library. 

It is a Python package that offers various *data structures* and *operations* for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is **fast** and it has high-performance & productivity for users.

Here is a list of things that we can do using Pandas:
- Data set cleaning, merging, and joining.
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.
- Columns can be inserted and deleted from DataFrame and higher dimensional objects.
-  Powerful group by functionality for performing split-apply-combine operations on data sets.
- Data Visulaization

Sources:
- https://pandas.pydata.org/
- https://www.w3schools.com/python/pandas/default.asp
- https://www.geeksforgeeks.org/pandas-tutorial/

<div style="background-color: lightgreen; padding: 10px;">
    <h2> Pandas DataFrames
</div>

Pandas generally provide two data structures for manipulating data, They are: 
- Series
- DataFrame

*Pandas DataFrames*

A Pandas ``DataFrame`` is a 2 dimensional data structure, like a 2 dimensional array, or a table with three principal components, the ``data``, ``rows``, and ``columns``.

Sources:
- https://www.w3schools.com/python/pandas/pandas_dataframes.asp
- https://www.geeksforgeeks.org/python-pandas-dataframe/

In [1]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

   calories  duration
0       420        50
1       380        40
2       390        45


---
We will get a brief insight on all these basic operation which can be performed on Pandas DataFrame :
- Creating a DataFrame
- Dealing with Rows and Columns
- Indexing and Selecting Data
- Working with Missing Data

In [3]:
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


In [6]:
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)

---
We can perform basic operations on rows/columns like ``selecting``, ``deleting``, ``adding``, and ``renaming``. 

**Column Selection:** In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their ``columns name``.

In [5]:
# select two columns
print(df[['Name', 'Qualification']])

     Name Qualification
0     Jai           Msc
1  Princi            MA
2  Gaurav           MCA
3    Anuj           Phd


**Row Selection:** Pandas provide a unique method to retrieve rows from a Data frame. ``DataFrame.loc[]`` method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an ``iloc[]`` function.

In [8]:
dados = [
  [1.69, 87.0],
  [1.59, 56.5],
  [1.69, 90.3],
  [1.74, 78.6]
]
df = pd.DataFrame(dados, columns=['altura', 'peso'])
df

Unnamed: 0,altura,peso
0,1.69,87.0
1,1.59,56.5
2,1.69,90.3
3,1.74,78.6


In [9]:
df['altura'][1] # if we try as in Python, we do [coluna][linha], using labels

1.59

In [10]:
df.loc[1, 'altura'] # Using loc, we do [linha, coluna], using labels

1.59

In [11]:
df.iloc[1, 0] # In case of iloc, we do [linha, coluna], using index position

1.59

---
**Working with Missing Data**

``Missing Data`` can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as ``NA``(Not Available) values in pandas. 

- Checking for missing values using ``isnull()`` and ``notnull()`` : In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). 

Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

In [12]:
import pandas as pd
import numpy as np
 
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
 
# creating a dataframe from list
df = pd.DataFrame(dict)
 
# using isnull() function  
df.isnull()

Unnamed: 0,First Score,Second Score,Third Score
0,False,False,True
1,False,False,False
2,True,False,False
3,False,True,False


In [13]:
# filling missing value using fillna()  
df.fillna(0)

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,0.0
1,90.0,45.0,40.0
2,0.0,56.0,80.0
3,95.0,0.0,98.0


In [14]:
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
 
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
 
# using dropna() function  
df.dropna()

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
3,95.0,56.0,98,65.0


<div style="background-color: lightgreen; padding: 10px;">
    <h2>  Read and Write files:
</div>

The `DataFrame` is a representation of a table. It has two labeled axes, which are the rows (labeled by the index or `index`) and the columns (labeled by an index object for the column names).

There are several ways to create a dataframe, including from lists, dictionaries, etc.

One of the most common methods is to create it by reading a file in the `.csv` format, as we'll see next for the case of the `titanic` dataset, well-known to those working with data science. The dataset can be downloaded [here](https://www.kaggle.com/competitions/titanic/data).

In [3]:
import pandas as pd
df = pd.read_csv('train.csv')

In [4]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


**Note:**
Notice that we can also use a file in the ``.xlsx`` format, which is native to ``Excel``.
For this, we should use the ``pd.read_excel`` method.

It's also possible to use many other less common formats, such as ``.data``, ``.txt``, and others. The most general method for reading these formats is ``pd.read_table``.

In [9]:
products = pd.Series(['Celular', 'Tablet', 'Notebook', 'Fone de Ouvido', 'Smartwatch'])
qty = pd.Series([15, 8, 20, 5, 12])

dic = {
    'name': ['Celular', 'Tablet', 'Notebook', 'Fone de Ouvido', 'Smartwatch'],
    'qty': [15, 8, 20, 5, 12]
}

df1 = pd.DataFrame(dic, index=[214314, 431431, 532453, 542234, 431413])
df1

Unnamed: 0,name,qty
214314,Celular,15
431431,Tablet,8
532453,Notebook,20
542234,Fone de Ouvido,5
431413,Smartwatch,12


In [10]:
df.to_csv("products.csv", index=False)

<div style="background-color: lightgreen; padding: 10px;">
    <h2>  Properties and Methods:
</div>


| FUNCTION          | DESCRIPTION                                                                                                        | FUNCTION         | DESCRIPTION                                                                                           |
|-------------------|--------------------------------------------------------------------------------------------------------------------|------------------|-------------------------------------------------------------------------------------------------------|
| index()           | Method returns index (row labels) of the DataFrame                                                                  | insert()         | Method inserts a column into a DataFrame                                                           |
| add()             | Method returns addition of dataframe and other, element-wise (binary operator add)                                 | sub()            | Method returns subtraction of dataframe and other, element-wise (binary operator sub)               |
| mul()             | Method returns multiplication of dataframe and other, element-wise (binary operator mul)                           | div()            | Method returns floating division of dataframe and other, element-wise (binary operator truediv)   |
| unique()          | Method extracts the unique values in the dataframe                                                                   | nunique()        | Method returns count of the unique values in the dataframe                                          |
| value_counts()    | Method counts the number of times each unique value occurs within the Series                                        | columns()        | Method returns the column labels of the DataFrame                                                    |
| axes()            | Method returns a list representing the axes of the DataFrame                                                          | isnull()         | Method creates a Boolean Series for extracting rows with null values                                |
| notnull()         | Method creates a Boolean Series for extracting rows with non-null values                                             | between()        | Method extracts rows where a column value falls in between a predefined range                         |
| isin()            | Method extracts rows from a DataFrame where a column value exists in a predefined collection                        | dtypes()         | Method returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns |
| astype()          | Method converts the data types in a Series                                                                            | values()         | Method returns a Numpy representation of the DataFrame i.e. only the values in the DataFrame will be returned, the axes labels will be removed |
| sort_values()- Set1, Set2 | Method sorts a data frame in Ascending or Descending order of passed Column                                   | sort_index()     | Method is called on a pandas Series to sort it by the index instead of its values                      |
| head()            | Method is used to return a specified number of rows from the beginning of a Series. The method returns a brand new Series | tail()           | Method is used to return a specified number of rows from the end of a Series. The method returns a brand new Series |
| get()             | Method is called on a Series to extract values from a Series. This is an alternative syntax to the traditional b              | value_counts()   | Method to count the number of the times each unique value occurs in a Series                             |
| factorize()       | Method helps to get the numeric representation of an array by identifying distinct values                            | map()            | Method to tie together the values from one object to another                                            |
| between()         | Pandas between() method is used on series to check which values lie between first and second argument                | apply()          | Method is called and feeded a Python function as an argument to use the function on every Series value. This method is helpful for executing custom operations that are not included in pandas or numpy |

In [11]:
# Continue the example with titanic dataset:
df.shape

(891, 12)

In [12]:
df.index

RangeIndex(start=0, stop=891, step=1)

In [25]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [23]:
df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [22]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [26]:
df['Cabin'].nunique()

147

In [27]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [34]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [40]:
df[df['Age']>70]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [41]:
df.sort_values('Age', ascending=False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [42]:
df = df.rename(columns={'Age': 'Ages', 'Embarked': 'City'})
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Ages,SibSp,Parch,Ticket,Fare,Cabin,City
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
