# Pandas:
- Actual usage of python is in data analytics.
- The name "Pandas" came from "Python Data Analytics".
- Pandas works on .csv which stands for "comma separated values".

## Pandas Series:
What is a Series?
- A Pandas Series is like a column in a table.
- It is a one-dimensional array holding data of any type.

## Series method in Pandas library. 

In [105]:
import pandas as pd

ser=[1,2,3]
var=pd.Series(ser)
print(var)

0    1
1    2
2    3
dtype: int64


### Create series of of any data type with any index 

In [106]:
ser=["Mon", "Tue", "Wed"]
var=pd.Series(ser, index=["day1", "day2", "day3"])
print(var["day2"])

Tue


### Create series from a dictionary by using DataFrame()

In [107]:
mydataset = {
    "cars": ["BMW", "Volvo", "Ford"],
    "passings": [3,7,2],
}

myvar=pd.DataFrame(mydataset)
print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


## Create Labels
With the index argument, you can name your own labels.

In [108]:
# Create your own labels:
import pandas as pd

a=[1,7,2]
myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

print("\nUsing labels:")
print(myvar[0])

x    1
y    7
z    2
dtype: int64

Using labels:
1


## Key/Value Objects as Series:
You can also use a key/value object, like a dictionary, when creating a Series.

In [109]:
# Create a simple Pandas Series from a dictionary
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3":390}

var = pd.Series(calories)

print(var)

print("printing with labels:")
print(var["day1"])

day1    420
day2    380
day3    390
dtype: int64
printing with labels:
420


## DataFrames 
- Data sets in Pandas are usually multi-dimensional tables, called DataFrames.\
- Series is like a column, a DataFrame is the whole table.

In [110]:
# Create a DataFrame from two Series:
import pandas as pd

data = {
    "calories" : [420, 380, 390],
    "duration" : [50, 40, 45]
}

myvar = pd.DataFrame(data)
print(myvar)

   calories  duration
0       420        50
1       380        40
2       390        45


## Locate Row
- As you can see form the result above, the DataFrame is like a table with rows and columns.
- Pandas use the loc attribute to return one or more specified row(s).
- Note: When using [ ], the result is a Pandas DataFrame

In [111]:
print("Return row 0")
print(myvar.loc[0])
print("Return row 0 and 1")
print(myvar. loc[[0,1]])

Return row 0
calories    420
duration     50
Name: 0, dtype: int64
Return row 0 and 1
   calories  duration
0       420        50
1       380        40


## Read CSV
- A simple way to store big data sets is to use CSV files (comma separated files),
- CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
- Note: if the .csv file is in different directory/folder, make sure you give the full path.
- Note: use to_string() to print the entire DataFrame.

In [112]:
# Load the CSV into a DataFrame
df = pd.read_csv("book1.csv") # will give FileNotFoundError if file does not exist
print(df.to_string())

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
1           NaN                  1984  11-04-2023
2           3.0        Fahrenheit 451  21-01-2019
3           4.0                   NaN  22-07-2017
4           5.0      Brave new world.  22-08-2020
5           6.0                   NaN      5/2009
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
8           NaN     Lord of the Rings  28-09-1994
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


## Printing DataFrame

In [113]:
print(df.head())

   Book number         Book Name        Date
0          1.0       Animal Farm  21-12-2020
1          NaN              1984  11-04-2023
2          3.0    Fahrenheit 451  21-01-2019
3          4.0               NaN  22-07-2017
4          5.0  Brave new world.  22-08-2020


In [114]:
print(df.tail())

    Book number             Book Name        Date
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
8           NaN     Lord of the Rings  28-09-1994
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


## Using info() on DataFrame:

In [115]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Book number  9 non-null      float64
 1   Book Name    9 non-null      object 
 2   Date         11 non-null     object 
dtypes: float64(1), object(2)
memory usage: 392.0+ bytes
None


## Printing large DataFrames:
- if DataFrame is too huge, first and last 10 rows are printed and in middle you get "..."

In [116]:
print(df) 

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
1           NaN                  1984  11-04-2023
2           3.0        Fahrenheit 451  21-01-2019
3           4.0                   NaN  22-07-2017
4           5.0      Brave new world.  22-08-2020
5           6.0                   NaN      5/2009
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
8           NaN     Lord of the Rings  28-09-1994
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


## Dropping rows with null values:
- Use dropna() to drop the entire rows with even a single null value.
- This will create a new dataframe
- To drop null values in-place, then use "inplace=True" as parameter

In [117]:
new_df = df.dropna()
print(new_df)

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
2           3.0        Fahrenheit 451  21-01-2019
4           5.0      Brave new world.  22-08-2020
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


In [118]:
df.dropna(inplace=True)
print(df)

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
2           3.0        Fahrenheit 451  21-01-2019
4           5.0      Brave new world.  22-08-2020
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


## Filling null values with some other input
- Use fillna() to fill null values with some other value
- This will create a new dataframe
- Use "inplace=True" as second parameter if you want to fill null values in-place
- Incase, you want to fill null values in a particular column you can use this function after you use [ ] to specify that column name.

In [119]:
# Load the CSV into a DataFrame
df = pd.read_csv("book1.csv") # will give FileNotFoundError if file does not exist
print(df.to_string())

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
1           NaN                  1984  11-04-2023
2           3.0        Fahrenheit 451  21-01-2019
3           4.0                   NaN  22-07-2017
4           5.0      Brave new world.  22-08-2020
5           6.0                   NaN      5/2009
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
8           NaN     Lord of the Rings  28-09-1994
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


In [120]:
new_df = df.fillna("NA")
print(new_df)

   Book number             Book Name        Date
0          1.0           Animal Farm  21-12-2020
1           NA                  1984  11-04-2023
2          3.0        Fahrenheit 451  21-01-2019
3          4.0                    NA  22-07-2017
4          5.0      Brave new world.  22-08-2020
5          6.0                    NA      5/2009
6          7.0       Sherlock Holmes  22-09-2001
7          8.0          Harry Potter  24-08-2003
8           NA     Lord of the Rings  28-09-1994
9         10.0  Song of Ice and Fire  20-10-2000
10         1.0           Animal Farm  21-12-2020


In [121]:
df.fillna("NA", inplace=True)
print(df)

   Book number             Book Name        Date
0          1.0           Animal Farm  21-12-2020
1           NA                  1984  11-04-2023
2          3.0        Fahrenheit 451  21-01-2019
3          4.0                    NA  22-07-2017
4          5.0      Brave new world.  22-08-2020
5          6.0                    NA      5/2009
6          7.0       Sherlock Holmes  22-09-2001
7          8.0          Harry Potter  24-08-2003
8           NA     Lord of the Rings  28-09-1994
9         10.0  Song of Ice and Fire  20-10-2000
10         1.0           Animal Farm  21-12-2020


## Replace only for specified columns
- The example above replaces all empty cells in the whole DataFrame.
- to only replace empty values for one column, specify the column name for DataFrame.

In [122]:
# Load the CSV into a DataFrame
df = pd.read_csv("book1.csv") # will give FileNotFoundError if file does not exist

df["Book Name"].fillna("BOOK NOT FOUND", inplace=True)
print(df)

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
1           NaN                  1984  11-04-2023
2           3.0        Fahrenheit 451  21-01-2019
3           4.0        BOOK NOT FOUND  22-07-2017
4           5.0      Brave new world.  22-08-2020
5           6.0        BOOK NOT FOUND      5/2009
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
8           NaN     Lord of the Rings  28-09-1994
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


## Cleaning data of wrong format
- if you dont specify the format, you will receive warnings

In [123]:
df = pd.read_csv("book1.csv")
df["Date"] = pd.to_datetime(df["Date"])
print(df.to_string())

    Book number             Book Name       Date
0           1.0           Animal Farm 2020-12-21
1           NaN                  1984 2023-11-04
2           3.0        Fahrenheit 451 2019-01-21
3           4.0                   NaN 2017-07-22
4           5.0      Brave new world. 2020-08-22
5           6.0                   NaN 2009-05-01
6           7.0       Sherlock Holmes 2001-09-22
7           8.0          Harry Potter 2003-08-24
8           NaN     Lord of the Rings 1994-09-28
9          10.0  Song of Ice and Fire 2000-10-20
10          1.0           Animal Farm 2020-12-21


  df["Date"] = pd.to_datetime(df["Date"])
  df["Date"] = pd.to_datetime(df["Date"])
  df["Date"] = pd.to_datetime(df["Date"])
  df["Date"] = pd.to_datetime(df["Date"])
  df["Date"] = pd.to_datetime(df["Date"])
  df["Date"] = pd.to_datetime(df["Date"])
  df["Date"] = pd.to_datetime(df["Date"])
  df["Date"] = pd.to_datetime(df["Date"])


## Fixing Wrong Data with loc[ ]

In [124]:
import pandas as pd
df = pd.read_csv("book1.csv")
df.loc[5, "Book Name"] = "Power of Habit"
print(df.to_string())

# we can also use if condition like:
# if df.loc[x, "Book number"] > 8:
#     df.loc[x, "Book number"] = 100

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
1           NaN                  1984  11-04-2023
2           3.0        Fahrenheit 451  21-01-2019
3           4.0                   NaN  22-07-2017
4           5.0      Brave new world.  22-08-2020
5           6.0        Power of Habit      5/2009
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
8           NaN     Lord of the Rings  28-09-1994
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020


## Checking Duplicates
- use duplicated() to see whether a particular row is duplicate or not
- note that all values in the entire row must be duplicate for a row to be duplicate

In [125]:
df = pd.read_csv("book1.csv")
print(df.to_string())
print("======================================================")
print(df.duplicated())

    Book number             Book Name        Date
0           1.0           Animal Farm  21-12-2020
1           NaN                  1984  11-04-2023
2           3.0        Fahrenheit 451  21-01-2019
3           4.0                   NaN  22-07-2017
4           5.0      Brave new world.  22-08-2020
5           6.0                   NaN      5/2009
6           7.0       Sherlock Holmes  22-09-2001
7           8.0          Harry Potter  24-08-2003
8           NaN     Lord of the Rings  28-09-1994
9          10.0  Song of Ice and Fire  20-10-2000
10          1.0           Animal Farm  21-12-2020
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
dtype: bool


## Removing Duplicates
- use drop_duplicates() to remove duplicate values.
- this will create a new DataFrame
- use parameter "inplace=True" to drop duplicates in-place.

In [126]:
df.drop_duplicates(inplace=True)
print(df.to_string())

   Book number             Book Name        Date
0          1.0           Animal Farm  21-12-2020
1          NaN                  1984  11-04-2023
2          3.0        Fahrenheit 451  21-01-2019
3          4.0                   NaN  22-07-2017
4          5.0      Brave new world.  22-08-2020
5          6.0                   NaN      5/2009
6          7.0       Sherlock Holmes  22-09-2001
7          8.0          Harry Potter  24-08-2003
8          NaN     Lord of the Rings  28-09-1994
9         10.0  Song of Ice and Fire  20-10-2000
