# 📘 Day 25
## Pandas

Pandas is an open source,high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Pandas adds data structures and tools designed to work with table-like data which is Series and Data Frames
Pandas provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation and imputation.
```py
pip install conda
conda install pandas
```
Pandas data structure is based on *Series* and *DataFrames*
A series is a column and a DataFrame is a multidimensional table made up of collection of series. In order to create a pandas series we should use numpy to create a one dimensional arrays.
An example of a series, names

![pandas series](images/pandas-series-1.png) 

Countries Series

![pandas series](images/pandas-series-2.png) 

Cities Series

![pandas series](images/pandas-series-3.png)

As you can see, pandas series is just one column data. If we want to have multiple columns we use data frames. The example below shows pandas DataFrames.

Let's see, an example of a pandas data frame:

![Pandas data frame](images/pandas-dataframe-1.png)

Data from is a collection of rows and columns. Look at the table below it has many columns than the above


![Pandas data frame](images/pandas-dataframe-2.png)

## Importing pandas

In [39]:
import pandas as pd
import numpy  as np

### Creating Pandas Series with default index

In [40]:
nums = [1, 2, 3,4,5]
s = pd.Series(nums)

In [41]:
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

### Creating  Pandas Series with custom index

In [42]:
fruits = ['Orange','Banana','Mangao']
fruits = pd.Series(fruits, index=[1, 2,3])

In [43]:
fruits

1    Orange
2    Banana
3    Mangao
dtype: object

### Creating Pandas Series from a dictionary

In [44]:
dct = {'name':'Asabeneh','country':'Finland','city':'Helsinki'}

In [45]:
s = pd.Series(dct)

In [46]:
s

name       Asabeneh
country     Finland
city       Helsinki
dtype: object

### Creating a constant pandas series

In [47]:
s = pd.Series(10, index = [1, 2,3])

In [48]:
s

1    10
2    10
3    10
dtype: int64

### Creating a  pandas series using linspace

In [52]:
s = pd.Series(np.linspace(5, 20, 10))

In [53]:
s

0     5.000000
1     6.666667
2     8.333333
3    10.000000
4    11.666667
5    13.333333
6    15.000000
7    16.666667
8    18.333333
9    20.000000
dtype: float64

## DataFrames

Pandas data frames can be created in different ways.

### Creating DataFrames from list of lists

In [63]:
data = [["Asabeneh", "Finland", "Helsink"], [
    "David", "UK", "London"], ["John", "Sweden", "Stockholm"]]
df = pd.DataFrame(data, columns=['Names','Country','City'])


In [64]:
df

Unnamed: 0,Names,Country,City
0,Asabeneh,Finland,Helsink
1,David,UK,London
2,John,Sweden,Stockholm


### Creating DataFrame using Dictionary

In [65]:
data = {"Name": ["Asabeneh", "David", "John"], "Country":[
    "Finland", "UK", "Sweden"], "City": ["Helsiki", "London", "Stockholm"]}
df = pd.DataFrame(data)

In [66]:
df

Unnamed: 0,Name,Country,City
0,Asabeneh,Finland,Helsiki
1,David,UK,London
2,John,Sweden,Stockholm


### Creating DataFrams from list of dictionaries

In [67]:
data = [
    {"Name": "Asabeneh", "Country":"Finland","City":"Helsinki"},
    {"Name": "David", "Country":"UK","City":"London"},
    {"Name": "John", "Country":"Sweden","City":"Stockholm"}]
df = pd.DataFrame(data)

In [68]:
df

Unnamed: 0,Name,Country,City
0,Asabeneh,Finland,Helsinki
1,David,UK,London
2,John,Sweden,Stockholm


## Reading CSV File using pandas

In [70]:
import pandas as pd

df = pd.read_csv('./data/weight-height.csv')

### Data Exploration
Let's read only the first 5 rows using head()

In [84]:
df.head() # give five rows we can increase the number of rows by passing argument to the head() method

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


As you can see the csv file has three rows:Gender, Height and Weight. But we don't know the number of rows. Let's use shape meathod.

In [85]:
df.shape # as you can see 10000 rows and three columns

(10000, 3)

Let's get all the columns using columns.


In [86]:
df.columns

Index(['Gender', 'Height', 'Weight'], dtype='object')

Let's read only the last 5 rows using tail()

In [87]:
df.tail() # tails give the last five rows, we can increase the rows by passing argument to tail method

Unnamed: 0,Gender,Height,Weight
9995,Female,66.172652,136.777454
9996,Female,67.067155,170.867906
9997,Female,63.867992,128.475319
9998,Female,69.034243,163.852461
9999,Female,61.944246,113.649103


Now, lets get specif colums using the column key


In [90]:
heights = df['Height'] # this is now a a series

In [92]:
heights

0       73.847017
1       68.781904
2       74.110105
3       71.730978
4       69.881796
          ...    
9995    66.172652
9996    67.067155
9997    63.867992
9998    69.034243
9999    61.944246
Name: Height, Length: 10000, dtype: float64

In [102]:
weights = df['Weight'] # this is now a series

In [103]:
weights

0       241.893563
1       162.310473
2       212.740856
3       220.042470
4       206.349801
           ...    
9995    136.777454
9996    170.867906
9997    128.475319
9998    163.852461
9999    113.649103
Name: Weight, Length: 10000, dtype: float64

In [104]:
len(heights) == len(weights)

True

In [105]:
heights.describe() # give statisical information about height data

count    10000.000000
mean        66.367560
std          3.847528
min         54.263133
25%         63.505620
50%         66.318070
75%         69.174262
max         78.998742
Name: Height, dtype: float64

In [106]:
weights.describe()

count    10000.000000
mean       161.440357
std         32.108439
min         64.700127
25%        135.818051
50%        161.212928
75%        187.169525
max        269.989699
Name: Weight, dtype: float64

In [109]:
df.describe()  # describe can also give statistical information from a datafrom

Unnamed: 0,Height,Weight
count,10000.0,10000.0
mean,66.36756,161.440357
std,3.847528,32.108439
min,54.263133,64.700127
25%,63.50562,135.818051
50%,66.31807,161.212928
75%,69.174262,187.169525
max,78.998742,269.989699


## Exercises: Day 25
1. Read the hacker_ness.csv file from data directory 
1. Get the first five rows
1. Get the last five rows
1. Get the title column as pandas series
1. Count the number of rows and columns
    * Filter the titles which contain python
    * Filter the titles which contain JavaScript
    * Explore the data and make sense of the data