<img src="images/pandas-intro.png">

# _Python_Pandas_Introduction

<img align="center" width="700" height="700"  src="images/pandas-apps.png"  >

> **Pandas is an open source python library built on top of numpy and provides easy to use data structures and data analysis tools. Pandas has derived its name from panel data system and was developed by wes mckinney in 2008.**

> **Data scientists use pandas for performing various data science tasks starting from downloading, opening, reading and writing files of different file formats like csv, excel, json, html and so on. They load the data set into its data structure called data frame.**

> **A Pandas Dataframe is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.**

> **After the data is loaded in a data frame the data scientists perform a various data manipulation tasks like filtering and modifying data based on multiple conditions cutting, splitting, merging, sorting, scaling, pivoting and aggregating of data.**

> **Data cleaning is done to enhance the data accuracy and integrity by identifying and removing null values, duplicates and outliers.**

> **Data wrangling actually transforms the data structurally to appropriate format and makes it ready to be used by the machine learning engineers so that they can apply appropriate machine learning models or algorithm on that data set for training validating and testing purposes.**

# Learning Agenda of this Notebook:
- What is Pandas and how is it used in AI?
- Key features of Pandas
- Data Types in Pandas
- What does Pandas deal with?

- Creating Series in Pandas
    - From Python List
    - From NumPy Arrays
    - From Python Dictionary
    - From a scalar value
    - Creating empty series object
- Attributes of a Pandas Series
- Arithmetic Operations on Series

- Dataframes in Pandas
    - Anatomy of a Dataframe
    - Creating Dataframe
        - An empty dataframe
        - Two-Dimensional NumPy Array
        - Dictionary of Python Lists
        - Dictionary of Panda Series
    - Attributes of a Dataframe
    - Bonus
- Data Handling with Pandas
  - Practice Exercise I
  - Practice Exercise II
- All Statistical functions in Pandas
- Input/Output Operations
- Aggregation & Grouping
  - Practice Exercise
- Merging, Joining and Concatenation
  - Practice Exercise
- How To Perform Data Visualization with Pandas
- Exercise I
- Exercise II
- Pandas's Assignment

### Data Structures in Pandas:
<img src="https://www.databricks.com/wp-content/uploads/2019/03/pandas1.png">

>-  **A Pandas Dataframe is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.**
>-  **In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do `data analysis and manipulation.`**

## So, what is Pandas and how is it used in AI?

Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.
- `Data Collection` – Conducting opinion Surveys, scraping the internet, etc.
- `Data Handling` – Viewing data as a table, performing cleaning activities like checking for spellings, removal of blanks and wrong cases, removal of invalid values from data, etc.
- `Data Visualization` – plotting appealing graphs, so anyone who looks at the data can know what story the data tells us.
- `Pandas` – short for `Panel Data` (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.

## Key Features of Pandas
<img src="images/Python-Pandas-Features.webp" height=600px width=600px>


- It has a fast and efficient DataFrame object with the default and customized indexing.
- Used for reshaping and pivoting of the data sets.
- Group by data for aggregations and transformations.
- It is used for data alignment and integration of the missing data.
- Provide the functionality of Time Series.
- Process a variety of data sets in different formats like matrix data, tabular heterogeneous, time series.
- Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-ordering, and re-shaping.
- It integrates with the other libraries such as SciPy, and scikit-learn.
- Provides fast performance, and If you want to speed it, even more, you can use the Cython.

## Data Types
A data type is used by a programming language to understand how to store and manipulate data.
- `int` : Integer number, eg: 10, 12
- `float` : Floating point number, eg: 100.2, 3.1415
- `bool` : True/False value
- `object` : Test, non-numeric, or a combination of text and non-numeric values, eg: Apple
- `DateTime` : Date and time values
- `category` : A finite list of values

## What does Pandas deal with?
There are two major categories of data that you can come across while doing data analysis.
- One dimensional data
- Two-dimensional data

These data can be of any data type. Character, number or even an object.

> **Series in Pandas is one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.**

![](images/dataframe.webp)

**In the example shown above, `Name` is a `series` and it is of the datatype – `Object` and it is treated as a character array. `Age` is another series and it is of the type – `Integer`. Third is the `Marks` is the third series and it is of the type `Integer` again.  The individual Series are one dimensional and hold only one data type. However, the `dataframe` as a whole contains more than 2 dimensions and is `heterogeneous` in nature.**

# Creating Series & data frames in python

## Creating a simple Serie


<img align="right" width="500" height="600"  src="images/series-anatomy.png"  >

> **A Series is a one-dimensional array capable of holding a sequence of values of any data type (integers, floating point numbers, strings, Python objects etc) which by default have numeric data labels starting from zero. You can imagine a Pandas Series as a column in a spreadsheet or a Pandas Dataframe object.**
- To create a Series object you can use `pd.Series()` method

**```pd.Series(data, index, dtype, name)```**
- Where,
   - `data`: can be a Python list, Python dictionary, numPy array, or a scalar value.
   - `index`: If you donot pass the index argument, it will default to `np.arrange(n)`. Indices must be hashable (numbers or strings) and have the same length as `data`. Non-unique index values are allowed. Index is used for three purposes:
       - Identification.
       - Selection.
       - Alignment.
   - `dtype`: Optionally, you can assign any valid numpy datatype to the series object (np.sctypes). If not specified, this will be inferred from `data`.
   - `name`: Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.

In [20]:
import pandas as pd
import numpy as np

In [9]:
pd.__version__, pd.__path__

('1.5.3', ['/home/dell/.local/lib/python3.8/site-packages/pandas'])

### a. Creating a Series from Python List

In [10]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', '','Dua']  # note the empty string

# When index is not provided, it creates an index for the data starting from zero and with a step size of one.
s = pd.Series(data=list1)

In [12]:
print(s)

print()
print()

print(type(s))

0    Ehtisham
1         Ali
2      Ayesha
3            
4         Dua
dtype: object


<class 'pandas.core.series.Series'>


> Observe that output is shown in two columns - the `index` is on the left and the `data value` is on the right. If we do not explicitly specify an index for the data values while creating a series, then by default indices range from `0` through `N – 1`. Here N is the number of data elements.

**You can explicitly specify the index for a Series object, which can be either int or string type, and must be of the same size as the values in the series. Otherwise, it will raise a ValueError**

In [13]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']
indices = ['01', '02', '', '02']   # non-unique index values are allowed and you can have empty string as index

s = pd.Series(data=list1, index=indices)

In [14]:
print(s)
print(type(s))

01    Ehtisham
02         Ali
        Ayesha
02         Dua
dtype: object
<class 'pandas.core.series.Series'>


In [15]:
s['02']

02    Ali
02    Dua
dtype: object

> Also note that non-unique indices are allowed.

In [17]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']
indices = [2.1, 2.2, 2.3, 2.4,]   

s = pd.Series(data=list1, index=indices)

In [18]:
print(s)
print()
print()

print(type(s))

2.1    Ehtisham
2.2         Ali
2.3      Ayesha
2.4         Dua
dtype: object


<class 'pandas.core.series.Series'>


**You can create a series with NaN values, using `np.nan`, which is IEEE 754 floating-point representation of Not a Number. NaN values can act as a placeholder for any missing numerical values in the array.**

In [21]:
list1 = [1, 2.7, np.nan, 54]


s = pd.Series(data=list1)

In [22]:
print(s)
print(type(s))

0     1.0
1     2.7
2     NaN
3    54.0
dtype: float64
<class 'pandas.core.series.Series'>


> Also note the `dtype` of the series object is inferred from the data as `float64`.

**You can use the `dtype` argument to specify a datatype to the series object.**


In [25]:
list1 = [27, 33, 19]

s = pd.Series(data=list1, dtype=np.uint8)

In [26]:
print(s)
print(type(s))

0    27
1    33
2    19
dtype: uint8
<class 'pandas.core.series.Series'>


**Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.**

In [27]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']

indices = ['01', '02', '03', '04']

s = pd.Series(data=list1, index=indices, name='myseries1') 


In [28]:
print(s)
print(type(s))

01    Ehtisham
02         Ali
03      Ayesha
04         Dua
Name: myseries1, dtype: object
<class 'pandas.core.series.Series'>


### b. Creating a Series from NumPy Array

In [31]:
s = pd.Series(data = np.arange(4))

In [None]:
print(s)
print(type(s))

### c. Creating a Series from Python Dictionary

In [35]:
my_dict = {
    'name':"Ehtisham", 
    'gender':"Male", 
    'Role':"Student", 
    'subject':"Cloud Computing"
}
s = pd.Series(data=my_dict)

In [36]:
print(s)
print()
print(type(s))

name              Ehtisham
gender                Male
Role               Student
subject    Cloud Computing
dtype: object

<class 'pandas.core.series.Series'>


> **When you create a series from dictionary, it will automatically take the keys as index and the value as data**

### d. Creating a Series from Scalar value

In [39]:
s = pd.Series(data=25)


print(s)
print(type(s))

0    25
dtype: int64
<class 'pandas.core.series.Series'>


### e. Creating an Empty Series

In [42]:
# Need to pass atleast `dtype` else you get a warning
s = pd.Series()

print(s)
print(type(s))

Series([], dtype: float64)
<class 'pandas.core.series.Series'>


  s = pd.Series()


## Attributes of Pandas  Series

- We can access certain properties called attributes of a series by using that property with the series name using dot `.` notation
- Mostly attributes of pandas series are similar to pandas dataframe.

In [44]:
my_dict = {
        0:"Ehtisham Sadiq", 1:np.nan, 2:"Ali Sadiq", 3:"Ayesha Sadiq", 
        4:"Dua Sadiq", 5:"Khubaib Sadiq", 6:"Adeen Sadiq"
    }

s = pd.Series(my_dict, name="myseries1")
s

0    Ehtisham Sadiq
1               NaN
2         Ali Sadiq
3      Ayesha Sadiq
4         Dua Sadiq
5     Khubaib Sadiq
6       Adeen Sadiq
Name: myseries1, dtype: object

In [45]:
# `name` attribute of a series object return the name of the series object
s.name

'myseries1'

In [46]:
# `index` attribute of a series object return the list of indices and its datatype
s.index

Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

In [47]:
# `values` attribute of a series object return the list of values and its datatype
s.values

array(['Ehtisham Sadiq', nan, 'Ali Sadiq', 'Ayesha Sadiq', 'Dua Sadiq',
       'Khubaib Sadiq', 'Adeen Sadiq'], dtype=object)

In [48]:
# `dtype` attribute of a series object return the type of underlying data
s.dtype

dtype('O')

In [49]:
# `shape` attribute of a series object return a tuple of shape of underlying data
s.shape

(7,)

In [50]:
# `nbytes` attribute of a series object return the number of bytes of underlying data (object data type take 8 bytes)
s.nbytes

56

In [51]:
# `size` attribute of a series object return number of elements in the underlying data
s.size

7

In [52]:
# `ndim` attribute of a series object return number of dimensions of underlying data
s.ndim

1

In [53]:
# `hasnans` attribute of a series object return true if there are NaN values in the data
s.hasnans

True

<img align="right" width="500" height="500"  src="images/series-anatomy.png"  >

## Understanding Index in a Series
- Every series object has an index associated with every item. 
- The Pandas series object supports both integer-based (default) and label/string-based indexing and provides a host of methods for performing operations involving the index.
<br><br>

- Index in series object is used for three purposes:
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment <br><br>
    
- There are three ways to access elements of a series:
    - Using `s[]` operator and specifying the index (integer/label)
    - Using `s.loc[]` method and specifying the index (integer/label)
    - Using `s.iloc[]` method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1

### Working

In [56]:
# Here write your code 

## Arithmetic Operations on Series

**Example 1:** Adding two series object with same integer indices

In [59]:
list1 = [1,3,5,7,9] 
list2 = [2,4,6,8,10]

s1 = pd.Series(data=list1)
s2 = pd.Series(data=list2)

In [60]:
print(s1)
print(s1.index)

0    1
1    3
2    5
3    7
4    9
dtype: int64
RangeIndex(start=0, stop=5, step=1)


In [61]:
print(s2)
print(s2.index)

0     2
1     4
2     6
3     8
4    10
dtype: int64
RangeIndex(start=0, stop=5, step=1)


In [62]:
s3 = s1 + s2

print(s3)
print(s3.index)

0     3
1     7
2    11
3    15
4    19
dtype: int64
RangeIndex(start=0, stop=5, step=1)


**Example 2:** Adding two series object having different integer indices.

In [64]:
# First Series
list1 = [6,9,7,5]
index1 = [0,1,2,3]
s1 = pd.Series(data=list1, index=index1);

# Second Series
list2 = [8,6,2,1]
index2 = [0,2,3,5]
s2 = pd.Series(data=list2, index=index2);

In [65]:
print(s1)
print(s1.index)

0    6
1    9
2    7
3    5
dtype: int64
Int64Index([0, 1, 2, 3], dtype='int64')


In [66]:
print(s2)
print(s2.index)

0    8
2    6
3    2
5    1
dtype: int64
Int64Index([0, 2, 3, 5], dtype='int64')


In [67]:
s3 = s1 + s2
print(s3)
print(s3.index)

0    14.0
1     NaN
2    13.0
3     7.0
5     NaN
dtype: float64
Int64Index([0, 1, 2, 3, 5], dtype='int64')


**Problem:** While performing mathematical operations on series having mismatched indices, all missing values are filled in with NaN by default.

**Solution:** To handle this problem, instead of using the operators (`+, -, *, /`), an explicit call to `s.add()`, `s.sub()`, `s.mul()` and `s.div()` is preferred. This allows us to replace the missing values in any of the series witth a specific value, so as to have a concrete output in place of NaN

In [73]:
s1.add(s2, fill_value=0) # Compare it with above result

0    14.0
1     9.0
2    13.0
3     7.0
5     1.0
dtype: float64

**My dear fellows, please make time to practice following topics related to Series:**

- Boolean/Fancy Indexing and Slicing
- Use of `reset_index()` method for completely resetting the index
- Use of other manipulation methods like 
    - `s.pop(index)` is passed an index and it returns the data item at the index and removes it from series
    - `s.drop(indexes)` is passed one or a list of indices and returns a series of the data items. Series remains unchanged unless the inplace = True argument is passed
    - `s1.append(s2, ignore_index=False, verify_integrity=False)` is used to concatenate two series and return the concatenated series, original series remain unchanged
    - `s1.update(s2)` is used to miduft the series `s1` inplace using the values from passed series

> **We will discuss these while studying Pandas Dataframe object In Shaa Allah**

### Pandas Series vs NumPy 1-D Arrays

>- In a series object we can define our own labeled index to access elements of an array. These can be numbers or strings. NumPy arrays are accessed  by their integer position using numbers only.
>- In a series object the elements can be indexed in descending order also. In NumPy arrays, the indexing starts with zero for the first element and the index is fixed.
>- While performing arithmetic operations on series having misaligned indices, NaN or missing values may be generated. In NumPy arrays, the concept of broadcasting exist and there is no concept of NaN values. While performing arithmetic on incompatible numPy arrays the operation fails.
>- Series require more memory. NumPy arrays occupies lesser memory.
    
    

## Practice Questions:
- Write a Pandas program to convert a Panda module Series to Python list and it’s type.
- Write a Pandas program to add, subtract, multiple and divide two Pandas Series having same indices.
- Write a Pandas program to compare the elements of the two Pandas Series.(Hint : pd.eq / pd.equals)
- Write a Pandas program to change the data type of given a column or a Series.
- Write a Pandas program to convert a given Series to an array(Hint : series.values.tolist())
- Write a Pandas program to sort a given Series.
- Write a Pandas program to add some data to an existing Series.(Hint : series.append()) 
- Write a Pandas program to create the mean and standard deviation of the data of a given Series.
- Write a Pandas program to get the items of a given series not present in another given series.(series.isin())


<img align="right" width="500" height="500"  src="images/dataframe.webp">


## 1. Creating a Dataframe
<br><br>
>**A Pandas Dataframe is a two-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.**

<br><br><br><br>

**```pd.DataFrame(data=None, index=None, columns=None, dtype=None)```**
- Where,
   - `data`: It can be a 2-D NumPy Array, a Dictionary of Python Lists, or a Dictionary of Panda Series (You can also create a dataframe from a file in CSV, Excel, JSON, HTML format or may be from a database table as well).
   - `index`: These are the row indices. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data.
   - `columns`: These are the column indices or labels. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data.
   - `dtype`: Data type to force. Only a single dtype is allowed. If None, infer.

#### Creating multiple series

In [None]:
name = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']
marks = [91.5,93,80,65]
age = [21,18,16,6]

#Creating a Series by passing list variable to Series() function of pandas 
name_ser = pd.Series(name)
marks_ser = pd.Series(marks)
age_ser = pd.Series(age)

#Printing Series
print("Name Series : ", name_ser, sep="\n")
print("Marks Series : ", marks_ser, sep="\n")
print("Age Series : ", age_ser,sep="\n")

#### Creating Dataframe from multiple Series 

In [None]:
#Creating a Series by passing list variable to Series() function of pandas 
name_ser = pd.Series(name)
marks_ser = pd.Series(marks)
age_ser = pd.Series(age)

# Creating a Dictionary by passing series as values of dictionary
dic = {'Name':name_ser,
      'Marks':marks_ser,
      'Age':age_ser
      }

# Create dataframe by passing dictionary to pd.DataFrame function of pandas
df = pd.DataFrame(dic)
print("Printing of DataFrame .... ")
df

#### How to add new column to the dataframe

In [None]:
address = pd.Series(['Lahore','Okara','Okara','Okara'])
##Creating new column in the dataframe by providing s Series created using list
df['Address'] = address
print("Printing of DataFrame .... ")
df

## Data Handling with Pandas..

- **Data Reading** : Reading from a csv or an excel – Pandas provide two functions – read_csv() and read_excel() to read data from a csv and an excel file respectively. Command can be used as follows.

- **Viewing data** – Viewing data from a data frame can be done by three ways
 >- using the data frame’s name – returns the top and bottom 5 rows in the data frame.
 >- using dataframe.head() function
 >- using dataframe.tail() function

- **Data Overview** : To see more details on the data frame, the `info()` function can be used. info() gives an idea about what datatype each series in a data frame points to.

- The following functions are used to find the unique entries within a series/column in a data frame.
 >- datafame.unique() – returns the unique values
 >- dataframe.nunique() – returns the count of unique values
 >- dataframe.value_counts() – returns the frequency of each of the categories in the column

- In our example, the titanic dataset contains a column called `Survived` which tells if the particular passenger survived the tragedy. Since this value could only be either 0 or 1, we can convert the data type from integer to object.
 >- `dataframe.astype()` is the function which lets us do the conversion

In [None]:
# !cat datasets/titanic3.csv

In [None]:
import os
# os.listdir('datasets/')

In [None]:
df = pd.read_csv('datasets/recent-grads.csv')

## Practice Questions Part 1: 
- Step 1. Import the necessary libraries.
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/u.user)
- Step 3. Assign it to a variable called users and use the `user_id` as index
- Step 4. See the first 25 entries
- Step 5. See the last 10 entries
- Step 6. What is the number of observations in the dataset?
- Step 7. What is the number of columns in the dataset?
- Step 8. Print the name of all the columns.
- Step 9. How is the dataset indexed?
- Step 10. What is the data type of each column?
- Step 11. Print only the occupation column
- Step 12. How many different occupations are in this dataset?
- Step 13. What is the most frequent occupation?
- Step 14. Summarize the DataFrame.
- Step 15. Summarize all the columns
- Step 16. Summarize only the occupation column
- Step 17. What is the mean age of users?
- Step 18. What is the age with least occurrence?

In [None]:
import pandas as pd
# data reading
url = "https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/u.user"
users = pd.read_csv(url, delimiter="|")
users.head()

In [None]:
# Task no 03
# First Method
users = pd.read_csv(url, delimiter="|", index_col='user_id')
users.head()


# Second method
# users.set_index('user_id')

In [None]:
# Task no 04
users.head(25)

In [None]:
# Task no 05
users.tail(10)

In [None]:
# Task no 06
users.shape
# or 
users.info()

In [None]:
# Task no 07 & 08
print("users.shape : ",users.shape)
users.columns

In [None]:
# Task no 09
users.index

In [None]:
# Task no 10
users.dtypes
 
#     or 
# users.info()

In [None]:
# Task no 11
users['occupation']

In [None]:
# Task no 12
# First method
users['occupation'].unique()

In [None]:
# Second method
users['occupation'].nunique()

In [None]:
# Third method
users['occupation'].value_counts()

In [None]:
# Task no 13
users['occupation'].value_counts()[0]

In [None]:
# Task no 14
users.describe()

In [None]:
# Task no 15
users.describe(include='all')

In [None]:
# Task no 16
users['occupation'].describe()

In [None]:
# Task no 17
users['age'].mean()

In [None]:
# Task no 18
users['age'].min()

In [None]:
# import pandas as pd
# pd.Series([])
# pd.DataFrame(dict)
# shape
# info() -> indcies of columns, columns name, total null values, datatype of each column, range of indcies
# dtypes -> return datatype of all columns
# describe() -> descriptive view dataset/dataframe (mean, max, min, std, count)
# head() -> top records/rows/indcies
# tail() -> last 5 records
# unique()  -> all unique values in given column(it is invalid for continous data)
# nunique() -> return count of unique values
# value_counts() -> return unique values along with their frequency

## Practice Questions Part 2:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/Euro_2012_stats_TEAM.csv)
- Step 3. Assign it to a variable called `euro12`.
- Step 4. Select only the Goal column.
- Step 5. How many team participated in the Euro2012?(value_counts/shape)
- Step 6. What is the number of columns in the dataset?(shape/info)
- Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline
- Step 8. Sort the teams by Red Cards, then to Yellow Cards(Hint: sort_values)
- Step 9. Calculate the mean Yellow Cards given per Team(Hint: round())
- Step 10. Filter teams that scored more than 6 goals
- Step 11. Select the teams that start with G(Hint : str.startswith('G'))
- Step 12. Select the first 7 columns and all the rows(Hint: iloc())
- Step 13. Select all columns except the last 3.(Hint: iloc())
- Step 14. Presents/shows only the Shooting Accuracy from England, Italy and Russia

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/Euro_2012_stats_TEAM.csv"
euro12 = pd.read_csv(url)
euro12.head(2)

In [None]:
euro12['Goals']


In [None]:
euro12['Team'].shape

In [None]:
# euro12.shape
# or 
euro12.info()

In [None]:
discipline = euro12[['Team','Yellow Cards','Red Cards']]
discipline.head()

In [None]:
# discipline = discipline.sort_values('Red Cards')
# discipline = discipline.sort_values('Yellow Cards')


#  Or

discipline.sort_values(by =['Red Cards','Yellow Cards'])


In [None]:
round(discipline.mean())

In [None]:
euro12.head()

In [None]:
euro12['Goals']>6

In [None]:
euro12[euro12['Goals'] > 6]

In [None]:
euro12['Team'].str.startswith('G')

In [None]:
euro12[euro12['Team'].str.startswith('G')]

In [None]:
euro12.iloc[:,:7]

In [None]:
euro12.iloc[:,-3:]

In [None]:
# # euro12.loc[), ['Team','Shooting Accuracy']]
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]

## All statistical functions
- `count()` : Returns the number of times an element/data has occurred (non-null)
- `sum()`	: Returns sum of all values
- `mean()` : Returns the average of all values
- `median()` : Returns the median of all values
- `mode()` : Returns the mode
- `std()`	: Returns the standard deviation
- `min()`	: Returns the minimum of all values
- `max()`	: Returns the maximum of all values
- `abs()`	: Returns the absolute value

In [None]:
print("Total number of elements in each column of dataframe ")
df.count()

In [None]:
df['Age'].count()

In [None]:
df.sum(numeric_only=True)

## Input and Output

- Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.
- The below table shows the formats supported by Pandas, the function to read files using Pandas, and the function to write files.
|Input |type      |	Reader	Writer |
|------|----------|----------------|
|CSV   |read_csv  |  to_csv        |
|JSON  |read_json | to_json
|HTML  |read_html |to_html
|Excel |read_excel|to_excel
|SAS   |read_sas  |–
|Python|Pickle    |	read_pickle	to_pickle
|SQL   |read_sql  |to_sql
|Google|Big Query | read_gbq	to_gbq

In [None]:
#Read input file
df = pd.read_csv('datasets/psl.csv')
df.head()

In [None]:
# Save a dataframe to CSV File
data = {'Name':['Captain America', 'Iron Man', 'Hulk', 'Thor','Black Panther'],
        'Rating':[100, 80, 84, 93, 90],
        'Place':['USA','USA','USA','Asgard','Wakanda']}
# Create dataframe from above dictionary
df = pd.DataFrame(data)
df
df.to_csv("datasets/avengers1.csv")

In [None]:
!cat datasets/avengers1.csv

## Aggregation
- The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
- Syntax : 
 >- DataFrame.aggregate(self, func, axis=0, *args, ***kwargs)
 
<img src="images/pandas-agg-func.png" height=400px width=600px>

In [None]:
data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
gapminder.head(3)



In [None]:
gapminder_data = gapminder[['continent','pop']]
gapminder_data.head()
# gapminder.head()

In [None]:
# Using Aggregate Functions on Series
mean  = gapminder_data['pop'].aggregate('mean')
print("Mean of population : ", mean)

Min  = gapminder_data['pop'].aggregate('min')
print("Minimum value of population : ", Min)

Max  = gapminder_data['pop'].aggregate('max')
print("Maximum value of population : ", Max)

Std  = gapminder_data['pop'].aggregate('std')
print("Std of population : ", Std)

In [None]:
# Using multiple Aggregate Functions on Dataframe
gapminder_data['pop'].aggregate(['sum','min','max'])

In [None]:
# Using multiple Aggregate Functions on Multiple columns of Dataframe
gapminder[['pop','lifeExp']].aggregate(['sum','min','max','std','mean','var'])

In [None]:
# We can also perform above task by using below code
gapminder.aggregate({'pop':['sum','min','max'],
                    'lifeExp':['sum','min','max']})

In [None]:
# df.describe()  gives overall descriptive view of our dataset
gapminder.describe()

## Groupby
- Pandas groupby function is used to split the DataFrame into groups based on some criteria. 
- Similar to the `SQL GROUP BY` clause pandas `DataFrame.groupby()` function is used to collect the identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.

<img src="images/pandas-groupby-standard-dev.png.webp" height=500px width=500px>
<img src="images/groupby-example.png" height=700px width=700px>

### Syntax of Pandas DataFrame.groupby()

       
       `DataFrame.groupby(by=None, axis=0, level=None, as_index=True,     
       sort=True, group_keys=True, squeeze=<no_default>,      
       observed=False, dropna=True)`
       
       
- `by` – List of column names to group by
- `axis` – Default to 0. It takes 0 or ‘index’, 1 or ‘columns’
- `level` – Used with MultiIndex.
- `as_index` – sql style grouped otput.
- `sort` – Default to True. Specify whether to sort after group
- `group_keys` – add group keys or not
- `squeeze` – depricated in new versions
- `observed` – This only applies if any of the groupers are Categoricals.
- `dropna` – Default to False. Use True to drop None/Nan on sory key.

In [None]:
import numpy as np

In [None]:
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python",np.nan],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pd.DataFrame(technologies)
df

#### Use groupby() to compute the sum of Fee and Discount of each course

In [None]:
df.groupby(['Courses']).sum()

In [None]:
# method 2
df.groupby(['Courses'])[['Fee','Discount']].sum()

In [None]:
# similarly
df.groupby(['Courses']).aggregate('sum')

#### pandas groupby() on Two or More Columns like Courses and Duration

In [None]:
df.groupby(['Courses','Duration']).mean()

In [None]:
# df

#### Add Index to the grouped data
- By default `groupby()` result doesn’t include row Index, you can add the index using `DataFrame.reset_index()` method.

In [None]:
df.groupby(['Courses','Duration']).mean().reset_index()

#### Remove sorting on grouped results by using `sort` parameter of df.groupby()

In [None]:
df2=df.groupby(by=['Courses'], sort=False).sum()
df2

#### Apply More Aggregations
- You can also compute several aggregations at the same time in pandas by passing the list of agg functions to the `aggregate().`

#### Compute minimu and maximum fee of each course

In [None]:
df.groupby('Courses')['Fee'].aggregate(['min','max'])

In [None]:
# Groupby multiple columns & multiple aggregations
df.groupby('Courses').aggregate({'Duration':'count',
                                'Fee':['min','max']})

## Practice Questions

### Regiment
- A regiment is a military unit. Its role and size varies markedly, depending on the country, service and/or a specialisation.



#### Step 1. Import the necessary libraries


In [None]:
import pandas as pd

#### Step 2. Create the DataFrame with the following values and Assign it to a variable called regiment.

In [None]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
regiment = pd.DataFrame(raw_data)
regiment

#### Step 3. What is the mean `preTestScore` from the regiment `Nighthawks`(Nightbird/Night owl)?


In [None]:
# First Method
regiment[regiment['regiment'] == 'Nighthawks'].describe()

In [None]:
regiment[regiment['regiment'] == 'Nighthawks']['preTestScore']

In [None]:
# Second Method
regiment[regiment['regiment'] == 'Nighthawks']['preTestScore'].mean()

In [None]:
# regiment[regiment['regiment'] == 'Nighthawks']

# 3rd Method
regiment.groupby('regiment').mean()

In [None]:
regiment.groupby('regiment').get_group('Nighthawks')['preTestScore'].mean()

#### Step 4. Present/show general statistics by `company` of regiment.

In [None]:
# regiment.groupby('company').describe()

regiment.groupby('company').describe()

#### Step 5. What is the mean of each company's preTestScore?

In [None]:
regiment.groupby('company')['preTestScore'].mean()

In [None]:
# regiment.groupby('company')['preTestScore'].mean()

# OR

regiment.groupby('company').mean()

#### Step 6. Presents/shows the `mean` preTestScores grouped by regiment and company.

In [None]:
regiment.groupby(['regiment','company'])['preTestScore'].mean()

In [None]:
# regiment.groupby(['regiment','company'])['preTestScore'].mean()
# OR
regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()

#### Step 7. Presents/shows the `mean` preTestScores grouped by regiment and company with reset_index parameter

In [None]:
regiment.groupby(['regiment', 'company'])['preTestScore'].mean().reset_index()

#### Step 8. Group the entire dataframe by regiment and company , also perform `sum` aggregate function.

In [None]:
regiment.groupby(['regiment','company']).sum()

#### Step 9. What is the number of observations in each regiment and company.

In [None]:
regiment.groupby(['regiment','company']).size()
# OR 
regiment.groupby(['regiment','company']).count()

#### Step 10. Iterate over a group and print the name and the whole data from the regiment

In [None]:
for name, data in regiment.groupby('regiment'):
    print("Group Name :", name)
    print("Group Data : ", data,sep="\n")

In [None]:
# # Group the dataframe by regiment, and for each regiment,
# for name, group in regiment.groupby('regiment'):
#     # print the name of the regiment
#     print('Name : ',name)
# #     print data of that regiment
#     print(group)

## Merging, Joining and Concatenation
Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.
<img src="images/Untitled.png" height=500px width=500px align="right"> 

- **Full outer join**: Combines results from both DataFrames. The result will have all columns from both DataFrames.
- **Inner join**: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
- **Right join**: Right join uses all records from DataFrame B and matching records from DataFrame A.
- **Left join**: Left join uses all records from DataFrame A and matching records from DataFrame B.

<img src="images/joins.png" height=600px width=600px align="left" > 


### Merging
- Merging a Dataframe with one unique key.

#### Syntax:
```
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
``` 
- `left` − A DataFrame object.
- `right` − Another DataFrame object.
- `on` − Columns (names) to join on. Must be found in both the left and right DataFrame objects.
- `left_on` − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
- `right_on` − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
- `left_index` − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
- `right_index` − Same usage as left_index for the right DataFrame.
- `how` − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been described below.
- `sort` − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.

In [None]:
# Define a dictionary containing employee data 
import pandas as pd
data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

df1, df2

In [None]:
# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on='key')
final_df

#### Merging Dataframe using multiple keys.

In [None]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

df1.Address, df2.Address

In [None]:
# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'])
final_df

#### Left merge
- In pd.merge() I pass the argument `how = left` to perform a left merge.

In [None]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

df1

In [None]:
df2

In [None]:
# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'], how='left')
final_df

#### Right merge
- In pd.merge() I pass the argument `how = right` to perform a left merge.

In [None]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 
df1

In [None]:
df2

In [None]:

# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'], how='right')
final_df

#### Outer Merge
- In pd.merge(), I pass the argument `how = outer` to perform a outer merge.

In [None]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 
df1

In [None]:
df2

In [None]:

# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'], how='outer')
final_df

In [None]:
df1.join()

## Join
- Join is used to combine DataFrames having different indcies values.
- `I have two different tables in Python but I’m not sure how to join them. What criteria should I consider? What are the different ways I can join these tables?`
- Sound familiar? I have come across this question plenty of times on online discussion forums. Working with one table is fairly straightforward but things become challenging when we have data spread across two or more tables.
- This is where the concept of Joins comes in. I cannot emphasize the number of times I have used these Joins in Pandas! They’ve come in especially handy during data science hackathons when I needed to quickly join multiple tables.

#### Understanding the Problem Statement

- I’m sure you’re quite familiar with e-commerce sites like `Amazon` and `Flipkart` these days. We are bombarded by their advertisements when we’re visiting non-related websites – that’s the power of targeted marketing!
- We’ll take a simple problem from a related marketing brand here. We are given two tables – one which contains data about products and the other that has customer-level information.
- We will use these tables to understand how the different types of joins work using Pandas.

#### Note: 
 >- Our task is to use our joining skills and generate meaningful information from the data.

In [None]:
# The product dataframe contains product details like Product_ID, Product_name, Category, Price, and Seller_City. 
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})

# The customer dataframe contains details like id, name, age, Product_ID, Purchased_Product, and City.
customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})

- Let’s say we want to know about all the products sold online and who purchased them. We can get this easily using an inner join.

- The `merge()` function in Pandas is our friend here. By default, the merge function performs an inner join. It takes both the dataframes as arguments and the name of the column on which the join has to be performed:

In [None]:
product

In [None]:
customer

In [None]:
pd.merge(product, customer, on='Product_ID')

- Here, I have performed inner join on the product and customer dataframes on the `Product_ID` column.
- But, what if the column names are different in the two dataframes? Then, we have to explicitly mention both the column names.
- `left_on` and `right_on` are two arguments through which we can achieve this. `left_on` is the name of the key in the left dataframe and `right_on` in the right dataframe

In [None]:
pd.merge(product, customer, left_on='Product_name', right_on='Purchased_Product')

- Let’s take things up a notice. The leadership team now wants more details about the products sold. They want to know about all the products sold by the seller to the same city i.e., seller and customer both belong to the same city.

- In this case, we have to perform an inner join on both Product_ID and Seller_City of product and Product_ID and City columns of the customer dataframe.

In [None]:
pd.merge(product, customer, left_on=['Product_ID', 'Seller_City'], right_on=['Product_ID','City'])

## Concatenation
Concatenating of two or more dataframes using `.concat()` function.

In [None]:
print("First DataFrame : ", df1, sep="\n")
print("Second DataFrame : ", df2, sep="\n")

In [None]:
frames = [df1, df2]
# concatenation using concate function
pd.concat(frames)

The resultant DataFrame has a repeated index. If you want the new Dataframe to have its own index, set `ignore_index` to True.

In [None]:
frames = [df1, df2]
# concatenation using concate function
pd.concat(frames, ignore_index=True)

#### Note: 
 >- The second DataFrame is concatenating below the first one, making the resultant DataFrame have new rows. If you want the second DataFrame to be added as columns, pass the argument axis=1.

In [None]:
frames = [df1, df2]
# concatenation using concate function
pd.concat(frames, axis=1,)

In [None]:
pd.concat(frames, axis=1)

#### Note: 
 >- Here columns of resultant dataframes are repeated to avoid this, we will append() function.

### Concatenating using `.append()` function
- Append function concatenates along axis = 0 only. It can take multiple objects as input.

In [None]:
df1.append(df2, ignore_index=True)

## Practices 
- Import pandas library.
- Download both datasets for this exercise from here [data1](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/data1.csv) and [data2](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/data2.csv) 

### Step 1 : Write a program to join the two given dataframes along rows and assign all to variable `data`.

In [None]:
data1 = pd.read_csv('datasets/data1.csv')
data2 = pd.read_csv('datasets/data2.csv')
print("First Data : ", data1, sep="\n")
print("Second Data : ", data2, sep="\n")
print("Joining of two dataframes along rows wise ...")
data = pd.concat([data1, data2],)
data

# pd.concat([df1,df2])

### Step 2 : Write a program to join the two given dataframes along columns and assign all to variable data.

In [None]:
# data1 = pd.read_csv('datasets/data1.csv')
# data2 = pd.read_csv('datasets/data2.csv')
# # print("First Data : ", data1, sep="\n")
# # print("Second Data : ", data2, sep="\n")
# print("Joining of two dataframes along rows wise ...")
data = pd.concat([data1, data2],axis=1)
data

### Step 3 : Write a Pandas program to append rows to an existing DataFrame `data1` and display the combined data.

In [None]:
s1 = pd.Series(['S6','Ehtisham Sadiq', 187], index=['student_id', 'name', 'marks'])
s1

In [None]:
combined_data = data1.append(s1, ignore_index=True)
print("Combined data : ", combined_data, sep="\n")

#### Summary
- For `pandas.DataFrame`, both `join` and `merge` operates on columns and rename the common columns using the given suffix. In terms of row-wise alignment, `merge` provides more flexible control.
- Different from `join` and `merge`, `concat` can operate on columns or rows, depending on the given axis, and no renaming is performed. In addition, `concat` allows defining hierachy structures by passing in `keys` and `names`.

## How To Perform Data Visualization with Pandas


#### Introduction
- Data visualization is the most important step in the life cycle of data science, data analytics, or we can say in data engineering. It is more impressive, interesting and understanding when we represent our study or analysis with the help of colours and graphics. Using visualization elements like graphs, charts, maps, etc., it becomes easier for clients to understand the underlying structure, trends, patterns and relationships among variables within the dataset. Simply explaining the data summary and analysis using plain numbers becomes complicated for both, people coming from technical and non-technical backgrounds. Data visualization gives us a clear idea of what the data wants to convey to us. It makes data neutral for us to understand the data insights.
- Data visualization involves operating a huge amount of data and converts it into meaningful and knowledgeable visuals using various tools. For visualizing data we need the best software tools to handle various types of data in structured or unstructured format from different sources such as files, web API, databases, and many more. We must choose the best visualization tool that fulfils all our requirements. The tool should support interactive plots generation, connectivity to data sources, combining data sources, automatically refresh the data, secured access to data sources, and exporting widgets. All these features allow us to make the best visuals of our data and also save time.
#### Advantages of Data Visualization
<img src="images/70513benefits.jpg" height=600px width=600px >

### Data Visualization with Pandas:

- Pandas library in python is mainly used for data analysis. It is not a data visualization library but, we can create basic plots using Pandas. Pandas is highly useful and practical if we want to create exploratory data analysis plots. We do not need to import other data visualization libraries in addition to Pandas for such tasks.

- As Pandas is Python’s popular data analysis library, it provides several different functions to visualizing our data with the help of the .plot() function. There is one more advantage of using Pandas for visualization is we can serialize or create a pipeline of data analysis functions and plotting functions. It simplifies the task.

#### Creating of Dataframe

In [None]:
x = pd.Series(np.arange(1,21))
y = x**1.2
z = x**1.7
w = x**.5

In [None]:
dict1 = {'col1':x,
        'col2':y,
        'col3':z,
        'col4':w}
# dict1

In [None]:
#importing packages
import numpy as np
import pandas as pd

#creating a DataFrame
df = pd.DataFrame(dict1)
# Since this is a randomly generated dataframe the values will differ everytime you run this code for everyone.

#displaying the DataFrame
df

#### Line plot:
- Line plot can be created with DataFrame.plot() function.

In [None]:
df.plot()

We have got the well-versed line plot for `df` without specifying any type of features in the `.plot()` function. We can plot graphs between two columns also

In [None]:
df.plot(x='col1', y='col4')

In [None]:
df[['col3','col4']].plot()

In [None]:
# We can also generate subplots for individual columns.
df.plot(subplots=True, figsize=(8,8))

### Bar plot:
- Now, we will create bar plots for the same dataframe. Bar plot can be created with `DataFrame.plot.bar()` function.

In [None]:
df.plot.bar()

In [None]:
df.plot.bar(stacked=True)
# In this bar plot, the bars are stacked.

In [None]:
df.plot.barh(stacked=True)
# In this bar plot, the bars are stacked.

### Histogram Plot:
Now, let’s generate a histogram for the `df`. Histogram plot can be created with `DataFrame.plot.hist()` function.

In [None]:
df.plot.hist()

In [None]:
# Now, let’s create a histogram with some other features.
df.plot.hist(stacked=True, bins=15)
# This is a stacked histogram.

In [None]:
df.plot.hist(orientation="horizontal", cumulative=True);
# Here, we have added a cumulative frequency in the histogram.

In [None]:
# Let’s create a histogram for each column individually.
df.diff().hist(bins=15)

### Box Plot:
Now, we will create box plot. Box plot can be created with `DataFrame.plot.box()` function or `DataFrame.boxplot()`.

In [None]:
# First Method    
df.plot.box()

In [None]:
# Second Method
df.boxplot()

In [None]:
# Now, generating the box plot in a horizontal form.
df.plot.box(vert=False)

### Area plot:
Now, we will create a area plot. Area plot can be created with `DataFrame.plot.area()` function. By default, it is stacked.

In [None]:
df.plot.area()

In [None]:
# Now, we will create unstacked area plot.
df.plot.area(stacked=False)

### Scatter plot:
Now, let’s generate a scatter plot. A Scatter plot can be created with `DataFrame.plot.scatter()` function. As we know scatter plot takes two-positional required arguments i.e. x and y to plot the graph. So, we will give the values of the  `x-axis` and `y-axis` as the name of columns.

In [None]:
df.plot.scatter('col4', 'col3')

In [None]:
# This is the scatter plot between col_1 and col_2 of dataframe df. Let’s apply some styles.
ax = df.plot.scatter('col1', 'col2', color='r', marker="*", s=100)
df.plot.scatter(x='col3',y='col4', color='b', s=100, ax=ax)

In this plot the data is spread with respect to col_2 and col_4 and the we have added some styles also like color, marker  and size of scatters. Let’s see another style of scatter plot

In [None]:
df.plot.scatter(x='col2', y='col4', c='col1', s=100)
# The c keyword is given as the name of a column to provide colours for each point.

### Pie chart:
- A Pie plot can be created with `DataFrame.plot.pie()` function or `Series.plot.pie()`. To generate a pie chart we will create series data as a pie chart is created only for one column. Let’s create a series named pie.

In [None]:
pie = pd.Series(np.random.randint(10,100,4))
pie

In [None]:
pie.plot.pie()

In [None]:
# Let's apply some styles
pie.plot.pie(autopct='%.2f')

### Pie Chart for DataFrame
- A Pie chart can be created for DataFrames  also but it will generate individual pies for each column of DataFrame in the form of subplots. Let’s Create a pie chart for the dataframe also

In [None]:
new_df = pd.DataFrame(np.random.randint(20,100,(5,3)),columns=['col1','col2','col3'])
new_df

In [None]:
new_df.plot.pie(subplots=True, figsize=(15,15), autopct='%.2f')

## Practice Exercise Part 1:
#### Visualizing the Titanic Disaster
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/train.csv)
- Step 3. Assign it to a variable titanic
- Step 4. Set PassengerId as the index
- Step 5. Create a pie chart presenting the male/female proportion
- Step 6. Create a scatterplot with the Fare payed and the Age, differ the plot color by gender
- Step 7. How many people survived and died , display using pie chart?
- Step 8. Create a histogram with the Fare payed.

In [None]:
# url = "https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/train.csv"
# titanic = pd.read_csv(url,)

# # titanic = pd.read_csv(url,index_col='PassengerId')
# # titanic.head()
# # OR
# titanic.set_index('PassengerId').head()
# titanic.shape
# males = (titanic.Sex == 'male').sum()
# females = (titanic.Sex =='female').sum()
# print("Total Males : ", males)
# print("Total Females : ", females)
# new_titanic = pd.Series([males,females])
# new_titanic
# new_titanic.plot.pie(labels=['Male','Female'], autopct='%.2f')
# titanic.columns
# # titanic.head()
# list1 = []
# for i in titanic.Sex:
#     if i=='male':
#         list1.append(1)
#     else:
#         list1.append(0)
# titanic['new_Sex'] = list1
# titanic.head()
# titanic.plot.scatter(x='Fare',y='Age',c='new_Sex',s=10)
# titanic.Fare.min(), titanic.Fare.max(), titanic.Fare.mean()
# titanic.Fare.plot.hist(bins=20)

### Practice Exercise Part 2:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset given below
- Step 3. Assign it to a variable `df3`
- Step 4. Create a scatter plot of `b` vs `a` by using `red` color.
- Step 5. Create a histogram of the `a` column.
- Step 6. Create a histogram of the `b` column and use bins=30.
- Step 7. Create a boxplot comparing the `a` and `b` columns.
- Step 8. Create a kde plot of the `d` column.
- Step 9. Create a kde plot of the `d` column and Figure out how to increase the linewidth and make the linestyle dashed. (Note: You would usually not dash a kde plot line)
- Step 10. Create an area plot of all the columns for just the rows up to 30. (hint: alpha=0.4)


In [None]:
# df3 = pd.DataFrame(np.random.rand(500,4), columns=['a','b','c','d'])
# df3.head()

## Basic Python Pandas Exercise 
- In this exercise, we are using `Automobile Dataset` for data analysis. This Dataset has different characteristics of an auto such as body-style, wheel-base, engine-type, price, mileage, horsepower, etc.
- Download dataset from this link [Automobile data_set](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/Automobile_data.csv)


In [None]:
url= "https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/Automobile_data.csv"
automobile = pd.read_csv(url)
automobile.head()

In [None]:
automobile.info()

In [None]:
automobile.describe()

### Exercise 1: From the given dataset print the first and last five rows.

In [None]:
automobile.tail()

In [None]:
automobile.sample(20)

### Exercise 2: Clean the dataset and update the CSV file(Hint: pd.read_csv(na_values={})
Replace all column values which contain `?`, `n.a`, or `NaN.`

In [None]:
automobile = pd.read_csv(url, na_values={'?':np.nan,
                                        'n.a':np.nan})

In [None]:
automobile

### Exercise 3: Find the most expensive car company name
Print most expensive car’s company name and price.     
**Expected Output:**
![](images/pandas_printing_most_costly_car_name.png)

In [None]:
a = automobile.groupby('company')[['price']].max().sort_values(by='price').tail(1)
a

In [None]:
# First Method
a = automobile.groupby(['company'])['price'].max()
a.sort_values(ascending=False).reset_index().head(1)

In [None]:
# Second Method
b = automobile[['company','price']][automobile['price'] == automobile['price'].max()]
b

In [None]:
automobile[['company','price']][automobile.price == automobile.price.max()]

### Exercise 4: Print All Toyota Cars details
**Expected Output**
![](images/pandas_printing_all_toyota_car_data.png)    

In [None]:
# automobile[automobile.company == 'toyota']

In [None]:
# # First Method
# automobile[automobile['company'] == 'toyota']

In [None]:
# # Second Method
group = automobile.groupby('company')
group.get_group('toyota')

### Exercise 5: Count total cars per company
**Expected Output**
![](images/pandas_count_total_cars_per_company.png)

In [None]:
automobile.company.value_counts()

### Exercise 6: Find each company’s Higesht price car
**Expected Outcome:**
![](images/pandas_printing_each_companys_higesht_price_car.png)

In [None]:
# First Method
automobile.groupby(['company'])['price'].max().reset_index()

In [None]:
# Second Method
automobile.groupby('company')[['company','price']].max()

### Exercise 7: Find the average mileage of each car making company
**Expected Output:**
![](images/pandas_printing_average_mileage_of_each_car_making_company.png)

In [None]:
# First Method
automobile.groupby('company')['company','average-mileage'].mean()

In [None]:
# Second Method
result = automobile.groupby('company')
result['company','average-mileage'].mean()

### Exercise 8: Sort all cars by Price column
**Expected Output:**
![](images/pandas_sort_all_cars_by_price_column.png)

In [None]:
automobile.sort_values(by=['price'], ascending=False).head()

### Exercise 9: Concatenate two data frames using the following conditions
Create two data frames using the following two dictionaries.
![](images/pandas_concatenate_two_data_frames_and_create_key_for_each_data_frame.png)

In [None]:
GermanCars = {'Company': ['Ford', 'Mercedes', 'BMV', 'Audi'], 'Price': [23845, 171995, 135925 , 71400]}
japaneseCars = {'Company': ['Toyota', 'Honda', 'Nissan', 'Mitsubishi '], 'Price': [29995, 23600, 61500 , 58900]}
German = pd.DataFrame(GermanCars)
Japan = pd.DataFrame(japaneseCars)
df = pd.concat([German,Japan], keys=['German','Japan'])
df

### Exercise 10: Merge two data frames using the following condition
Create two data frames using the following two Dicts, Merge two data frames, and append the second data frame as a new column to the first data frame.
![](images/merge_two_data_frames_and_append_new_data_frame_as_new-column.png)

In [None]:
Car_Price = {'Company': ['Toyota', 'Honda', 'BMV', 'Audi'], 'Price': [23845, 17995, 135925 , 71400]}
car_Horsepower = {'Company': ['Toyota', 'Honda', 'BMV', 'Audi'], 'horsepower': [141, 80, 182 , 160]}
price = pd.DataFrame.from_dict(Car_Price)
horsepower = pd.DataFrame.from_dict(car_Horsepower)
pd.merge(price,horsepower,on='Company')

## Pandas Data Visualization Exercise
This is just a quick exercise for you to review the various plots we showed earlier. Use **datasets/practice.csv** to replicate the following plots. 

In [None]:
import pandas as pd
import numpy as np

### Q-01: Import your dataset and also display first five rows of your dataset.

In [None]:
df = pd.read_csv('datasets/practice')
df.head()

**Q-02: Create this scatter plot of `b` vs `a`. Note the color and size of the points. Also note the figure size. See if you can figure out how to stretch it in a similar fashion. Remeber back to your matplotlib lecture...**

In [None]:
df.plot.scatter(x='b',y='a', color='r', s=100, figsize=(10,5), title="Scatter plot of b vs a")

**Create a histogram of the 'a' column.**

In [None]:
# df.plot.hist(by= 'a', bins=20)

**These plots are okay, but they don't look very polished. Use style sheets to set the style to 'plt.style.use('ggplot') and redo the histogram from above. Also figure out how to add more `bins` and `alpha` to it.***

In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-dark')

**Create a boxplot comparing the `a` and `b` columns.**

In [None]:
df[['a','b']].plot.box()

**Create a kde plot of the `d` column**

In [None]:
df.plot('d', kind='kde')

**Figure out how to increase the linewidth and make the linestyle dashed. (Note: You would usually not dash a kde plot line)**

**Create an area plot of all the columns for just the rows up to `30.` (hint: use `.ix`).**

# Pandas - Assignment No 01
- Click here to solve [Pandas - Assignment no 01](https://www.kaggle.com/code/ehtishamsadiq/pandas-assignment-no-01)

In [74]:
from IPython.core.display import HTML

style = """
    <style>
        body {
            background-color: #f2fff2;
        }
        h1 {
            text-align: center;
            font-weight: bold;
            font-size: 36px;
            color: #4295F4;
            text-decoration: underline;
            padding-top: 15px;
        }
        
        h2 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #4A000A;
            text-decoration: underline;
            padding-top: 10px;
        }
        
        h3 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #f0081e;
            text-decoration: underline;
            padding-top: 5px;
        }

        
        p {
            text-align: center;
            font-size: 12 px;
            color: #0B9923;
        }
    </style>
"""

html_content = """
<h1>Hello</h1>
<p>Hello World</p>
<h2> Hello</h2>
<h3> World </h3>
"""

HTML(style + html_content)