# Pandas

The pandas library is useful for dealing with ***structured data***.<br>

What is structured data? <br>
Data that is stored in tables, csv files, Excel Spreadsheets or database tables, is all structured.<br>

Unstructured data consists of free form text, images,sound or video.<br>

If you are using structured data pandas will be a great utility to you.


## Importing Pandas

Most users of pandas library will use an import alias so they can refer to it as **pd**

In [8]:
import pandas as pd

# Series

Series is a one-dimensional labeled array capable of holding any data type (integers, floats, strings, objects, etc.). It is essentially a column in a spreadsheet or a single-dimensional NumPy array with additional functionalities.<br>

Key Characteristics:
* One-dimensional: Data is arranged in a single column.
* Labeled: Each element has an associated label (index).
* Immutable: size immutable.

NOTE: When we say that series can hold any data type, we mean to say that the entire column can be of any datatype not individual values in the entire column.

![class 5](series_anatomy.png)
Image Source - Pandas Cookbook

## Creating a Series 

In [9]:
# creating a series from listed data
data = ['a','e','i','o','u']
s = pd.Series(data)
print(f'Series from a list:\n{s}')

# From a NumPy array
import numpy as np
data = np.array([1, 2, 3, 4, 5])
s = pd.Series(data)
print(f'Series from numpy array:\n{s}')

# From a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(f'Series from a dictionary:\n{s}')

Series from a list:
0    a
1    e
2    i
3    o
4    u
dtype: object
Series from numpy array:
0    1
1    2
2    3
3    4
4    5
dtype: int64
Series from a dictionary:
a    1
b    2
c    3
dtype: int64


NOTE: More methods and operation will be talked about in later course.

# Data Frame

## Introduction to Data Frame

* A DataFrame is essentially a two-dimensional labeled data structure with columns of potentially different types.<br>
* In simple terms - DataFrame: A table of data with rows and columns, where each column is a Series.
* Visually They appear like table consisting od *Rows* and *Columns*.<br>
* Hiding beneath the surface are the three components: *`index`*, *`column`*, *`data`*.

![class 6](anatomy_dataframe.png)

*`Index Labels`* and *`Column name`* refer to the individual memeber of index and columns,respectively.<br>
`Index` refers to the Index label as a whole and `Column` refers to the column name as a whole.

The labels in index and column names allow for pulling out data based on the index and column name. The index is also used for *alighment*. When multiple Series or DataFrames are combined, the indexes align first before any calculation occurs.

Collectively, the columns and the index are know as the axes.<br>
**Index - Axis 0**<br>
**Columns - Axis 1**

Pandas uses **NaN** (Not a number) *to represent missing values (including to represnt a missing string value)*.

The three consecutive dots, `...` represent that there is atleast one column that exists but could not be displayed due to display limit.

### Creating DataFrames

There are multiple ways to create a dataframe using the DataFrame() object.

NOTE: You can also create dataframe when you read a file which will be thought later in class.

In [10]:
# Creating DataFrame using Dictonary 
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(f'DataFrame using Dict:\n{df}')

# Creating DataFrame using numpy arrays
data = np.array([
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(f'DataFrame using Numpy array:\n{df}')

# Creating DataFrame using list of lists
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(f'DataFrame using Lists of List:\n{df}')

DataFrame using Dict:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
DataFrame using Numpy array:
      Name Age         City
0    Alice  25     New York
1      Bob  30  Los Angeles
2  Charlie  35      Chicago
DataFrame using Lists of List:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


### DataFrame Attributes

DataFrame attributes provide metadata and basic information about the DataFrame.<br>
Some of the DataFrame attributes are:
`df.shape`, 
`df.columns`, 
`df.index` ,
`df.dtypes` ,
`df.size` 

NOTE: 
1. You can use the print statement to print these variable or just execute the variable to see the values.
2. You do not necessarily have to name your data frame to *df* you can give it a different name. Just like in Algebra to find unknow you use 'x' but you can name the unknow variable any other alphabet.

#### `df.shape` - Returns a tuple representing the dimensionality of the DataFrame.<br>

In [11]:
data = {
    'Region': ['Europe', 'North America', 'Asia', 'Africa', 'South America'],
    'No. of Tourists': [30000000, 25000000, 45000000, 15000000, 22000000],
    'Average Temperature (F)': [55, 65, 75, 85, 70]
}
df = pd.DataFrame(data)
print('DataFrame:\n',df)

print(f'')
print(f'Shape of Data Frame: {df.shape}') # shape of the dataframe


DataFrame:
           Region  No. of Tourists  Average Temperature (F)
0         Europe         30000000                       55
1  North America         25000000                       65
2           Asia         45000000                       75
3         Africa         15000000                       85
4  South America         22000000                       70

Shape of Data Frame: (5, 3)


#### `df.columns` - Returns an Index object containing the column labels.<br>

In [12]:
#print(f'Columns of Data Frame: {df.columns}') # gices the column names of the datagrame
df.columns

Index(['Region', 'No. of Tourists', 'Average Temperature (F)'], dtype='object')

#### `df.index` - Returns an Index object containing the row labels.<br>

In [13]:
#print(f'Index of Data Frame: {df.index}') #gives the index of the dataframe
df.index

RangeIndex(start=0, stop=5, step=1)

In [14]:
# to find the index of a series you can do it the following way

df['Region'].index

RangeIndex(start=0, stop=5, step=1)

#### `df.dtypes` - Returns the data types of each column.<br>

In [15]:
#print(f'Data-types of columns of Data Frame: {df.dtypes}') #returns the datatype of the dataframe
df.dtypes

Region                     object
No. of Tourists             int64
Average Temperature (F)     int64
dtype: object

In [16]:
# to find the datatype of the series
df['Region'].dtype #change the city name to 'No. of Tourist and see the datatype

dtype('O')

#### `df.size` - Returns the number of elements in the DataFrame.<br>

In [17]:
#print(f'Size of Data Frame: {df.size}') #returns the length of the dataframe
df.size

15

### DataFrame Methods
`head()`<br>
`tail()`<br>
`info()`<br>
`describe()`<br>
`drop()`<br>
`value_counts()`<br>
`rename()`


#### `head()`: Shows the first n rows of the dataframe.<br>

In [25]:
# Data containing popular cities and some information regarding City, state, population, temp.
# this is not a true data and has been created for understanding the concept purpose
data = {
    'City Name': ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
    'State Name': ['New York', 'California', 'Illinois', 'Texas', 'Arizona', 'Pennsylvania', 'Texas', 'California', 'Texas', 'California'],
    'Population (approx.)': [8400000, 4000000, 2700000, 2300000, 1600000, 1600000, 1500000, 1400000, 1400000, 1000000],
    'Average Temperature (F)': [52, 64, 51, 69, 78, 54, 72, 67, 66, 62],
    'Average Income (USD)': [73000, 68000, 62000, 65000, 58000,None, 55000, 70000, 63000, 85000]
}
city_df = pd.DataFrame(data)


# use head() to get the first n rows of dataframe
print(f'First 5 rows of the data\n{df.head(5)}') #replace 5 with any other number to see the results

First 5 rows of the data
       city_name  state_name  population_approx  avg.temp_f
0  New York City    New York            8400000          52
1    Los Angeles  California            4000000          64
2        Chicago    Illinois            2700000          51
3        Houston       Texas            2300000          69
4        Phoenix     Arizona            1600000          78


#### `tail()`: Shows the last n rows of the dataframe.<br>

In [26]:
# use tail to get the last n rows of the dataframe
print(f'Last 6 rows of the data\n{city_df.tail(6)}') # replace 6 with any no. to see the results

Last 6 rows of the data
      City Name    State Name  Population (approx.)  Average Temperature (F)  \
4       Phoenix       Arizona               1600000                       78   
5  Philadelphia  Pennsylvania               1600000                       54   
6   San Antonio         Texas               1500000                       72   
7     San Diego    California               1400000                       67   
8        Dallas         Texas               1400000                       66   
9      San Jose    California               1000000                       62   

   Average Income (USD)  
4               58000.0  
5                   NaN  
6               55000.0  
7               70000.0  
8               63000.0  
9               85000.0  


#### `info()`: Provides a summary of the DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.<br>

In [27]:
city_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   City Name                10 non-null     object 
 1   State Name               10 non-null     object 
 2   Population (approx.)     10 non-null     int64  
 3   Average Temperature (F)  10 non-null     int64  
 4   Average Income (USD)     9 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 532.0+ bytes


#### `describe()`: Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.<br>

In [28]:
city_df.describe()

Unnamed: 0,Population (approx.),Average Temperature (F),Average Income (USD)
count,10.0,10.0,9.0
mean,2590000.0,63.5,66555.555556
std,2219835.0,8.897565,8931.840671
min,1000000.0,51.0,55000.0
25%,1425000.0,56.0,62000.0
50%,1600000.0,65.0,65000.0
75%,2600000.0,68.5,70000.0
max,8400000.0,78.0,85000.0


#### `drop()`: Drops specified labels from rows or columns.<br>


In [29]:
# Let's delete the column 'Average Income(USD)'
city_df.drop(columns = ['Average Income (USD)'],axis = 1,inplace = True) 
city_df

#If the column does not exist it will throw an error.
#modify the inplace = False and print the df(DataFrame) in the next cell to see the results
#city_df.drop(5,axis=0)

Unnamed: 0,City Name,State Name,Population (approx.),Average Temperature (F)
0,New York City,New York,8400000,52
1,Los Angeles,California,4000000,64
2,Chicago,Illinois,2700000,51
3,Houston,Texas,2300000,69
4,Phoenix,Arizona,1600000,78
5,Philadelphia,Pennsylvania,1600000,54
6,San Antonio,Texas,1500000,72
7,San Diego,California,1400000,67
8,Dallas,Texas,1400000,66
9,San Jose,California,1000000,62


When `inplace = False` , which is the default, then the operation is performed and it returns a copy of the object. You then need to save it to something. When `inplace = True` , the data is modified in place, which means it will return nothing and the dataframe is now updated

#### `value_counts()`: Returns a Series containing counts of unique values.

In [30]:
#The .value_counts method returns the count of all the data types in the DataFrame 
# when called on the .dtypes attribute.
dtypes_count=city_df.dtypes.value_counts() 
print(dtypes_count)

print(f'')
u_state_count=city_df['State Name'].value_counts() #gives count of unique states from the dataframe
print(f'{u_state_count}')

object    2
int64     2
Name: count, dtype: int64

State Name
California      3
Texas           3
New York        1
Illinois        1
Arizona         1
Pennsylvania    1
Name: count, dtype: int64


`rename()` - This method allows you to rename columns or index labels with a dictionary mapping of old names to new names.

In [31]:
# lets rename all the columns 
city_df.rename(columns={
    'City Name':'city_name',
    'State Name':'state_name',
    'Population (approx.)':'population_approx',
    'Average Temperature (F)':'avg.temp_f'
    }, inplace = True)

city_df

Unnamed: 0,city_name,state_name,population_approx,avg.temp_f
0,New York City,New York,8400000,52
1,Los Angeles,California,4000000,64
2,Chicago,Illinois,2700000,51
3,Houston,Texas,2300000,69
4,Phoenix,Arizona,1600000,78
5,Philadelphia,Pennsylvania,1600000,54
6,San Antonio,Texas,1500000,72
7,San Diego,California,1400000,67
8,Dallas,Texas,1400000,66
9,San Jose,California,1000000,62


# Quering Rows and Columns

## Selecting Data

### `loc` Method

loc is label-based, which means that you have to specify rows and columns based on their labels.

In [36]:
#Let's Use the above city example
city_df.loc[:,"state_name"] #print all the rows of the colum 'state_name'

0        New York
1      California
2        Illinois
3           Texas
4         Arizona
5    Pennsylvania
6           Texas
7      California
8           Texas
9      California
Name: state_name, dtype: object

In [37]:
city_df.loc[0] #Print only the first row

city_name            New York City
state_name                New York
population_approx          8400000
avg.temp_f                      52
Name: 0, dtype: object

In [40]:
city_df.loc[0:5] #print the first 6 rows

Unnamed: 0,city_name,state_name,population_approx,avg.temp_f
0,New York City,New York,8400000,52
1,Los Angeles,California,4000000,64
2,Chicago,Illinois,2700000,51
3,Houston,Texas,2300000,69
4,Phoenix,Arizona,1600000,78
5,Philadelphia,Pennsylvania,1600000,54


In [42]:
city_df.loc[0:2,['city_name','population_approx']] #prints mutiple rows and columns

Unnamed: 0,city_name,population_approx
0,New York City,8400000
1,Los Angeles,4000000
2,Chicago,2700000


### `iloc` Method

It is an integer-based, which means you have to specify rows and columns by their integer position.

In [43]:
city_df.iloc[4] #selecting a single row by interger position

city_name            Phoenix
state_name           Arizona
population_approx    1600000
avg.temp_f                78
Name: 4, dtype: object

In [44]:
city_df.iloc[2:5] #selecting multiple rows

Unnamed: 0,city_name,state_name,population_approx,avg.temp_f
2,Chicago,Illinois,2700000,51
3,Houston,Texas,2300000,69
4,Phoenix,Arizona,1600000,78


In [45]:
city_df.iloc[:,3] #Selecting all rows but a single column

0    52
1    64
2    51
3    69
4    78
5    54
6    72
7    67
8    66
9    62
Name: avg.temp_f, dtype: int64

In [47]:
city_df.iloc[3:6,0:3] #selecting multiple rows and columns

Unnamed: 0,city_name,state_name,population_approx
3,Houston,Texas,2300000
4,Phoenix,Arizona,1600000
5,Philadelphia,Pennsylvania,1600000


### `Boolean Indexing`

Boolean indexing allows you to filter data based on conditions.

## Filtering Data

### Creating New Columns

### Modifying Existing Columns

### Arithmetic Operations Between Columns

# Operations Between Columns

# IO Operations