# Pandas Practice
___

### 1. Importing Pandas
- Pandas has a reference to both "Panel Data" and "Pyhton Data Analysis"

- Import Pandas in your applications by adding the import keyword

- Pandas is usually imported under the pd alias.

In [1]:
import pandas as pd

data = {
    'Company' : ['Ford', 'Mahindra', 'Maruti'],
    'Car' : ['Mustang', 'Thar', '800']
}

myvar = pd.DataFrame(data)

print(myvar)
print()
print(myvar['Car'])

    Company      Car
0      Ford  Mustang
1  Mahindra     Thar
2    Maruti      800

0    Mustang
1       Thar
2        800
Name: Car, dtype: object


- Checking Pandas Version

In [2]:
print(pd.__version__)

2.2.3


### 2. Pandas Series
- A Pandas Series is like a column in a table.

- It is a one-dimensional array holding data of any type.

In [3]:
a = list(range(1,10,2))

myvar = pd.Series(a)
print(myvar)

0    1
1    3
2    5
3    7
4    9
dtype: int64


#### 2.1 Labels
- If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

- This label can be used to access a specified value

In [4]:
a = list(range(1,10,2))

myvar = pd.Series(a)
# print(myvar)
print(myvar[0])

1


##### 2.1.1 Create Labels
- With the index argument, you can name your own labels.

In [5]:
a = [17, 7, 72]
myvar = pd.Series(a, index=['i', 'me', 'myself'])
print(myvar)

i         17
me         7
myself    72
dtype: int64


- When you have created labels, you can access an item by referring to the label.

In [6]:
a = [17, 7, 72]
myvar = pd.Series(a, index=['i', 'me', 'myself'])
# print(myvar)
print(myvar['me'])

7


##### 2.2 Key/Value Objects as Series
- Can also use a key/value object, like a dictionary, when creating a Series.

-  The keys of the dictionary become the labels.

In [7]:
steps = {'day1' : 6750, 'day2' : 5690, 'day3' : 7000}
myvar = pd.Series(steps)
print(myvar)

day1    6750
day2    5690
day3    7000
dtype: int64


- To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [8]:
steps = {'day1' : 6750, 'day2' : 5690, 'day3' : 7000}
myvar = pd.Series(steps, index=['day3', 'day2'])
print(myvar)

day3    7000
day2    5690
dtype: int64


### 3. Data Frames (DataFrame)
- Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

- Series is like a column, a DataFrame is the whole table.

- A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [9]:
data = {
    'Company' : ['Ford', 'Mahindra', 'Maruti'],
    'Car' : ['Mustang', 'Thar', '800']
}

df = pd.DataFrame(data)
print(df)

    Company      Car
0      Ford  Mustang
1  Mahindra     Thar
2    Maruti      800


#### 3.1 Locate Row
- Pandas use the loc attribute to return one or more specified row(s)

In [10]:
data = {
    'Company' : ['Ford', 'Mahindra', 'Maruti'],
    'Car' : ['Mustang', 'Thar', '800']
}

df = pd.DataFrame(data)
print(df.loc[1]) # returns a Pandas Series.
print()
# use of list indexes
print(df.loc[[0, 1]])

Company    Mahindra
Car            Thar
Name: 1, dtype: object

    Company      Car
0      Ford  Mustang
1  Mahindra     Thar


#### 3.2 Named Indexes
- With the index argument, you can name your own indexes.

In [11]:
data = {
    'Company' : ['Ford', 'Mahindra', 'Maruti'],
    'Car' : ['Mustang', 'Thar', '800']
}

df = pd.DataFrame(data, index = ["car1", "car2", "car3"])
print(df)

       Company      Car
car1      Ford  Mustang
car2  Mahindra     Thar
car3    Maruti      800


##### 3.2.1 Locate Named Indexes
- Use the named index in the loc attribute to return the specified row(s).

In [12]:
data = {
    'Company' : ['Ford', 'Mahindra', 'Maruti'],
    'Car' : ['Mustang', 'Thar', '800']
}

df = pd.DataFrame(data, index = ["car1", "car2", "car3"])
print(df.loc['car2'])

Company    Mahindra
Car            Thar
Name: car2, dtype: object


#### 3.3 Load Files Into a DataFrame
- If data sets are stored in a file, Pandas can load them into a DataFrame.

In [13]:
df = pd.read_csv('cars_advanced.csv')
print(df)

    Car          Brand     Model  Year Engine Size Fuel Type  Mileage (km)  \
0     1         Toyota   Corolla  2018        1.8L    Petrol       42000.0   
1     2          Honda     Civic  2020        2.0L    petrol       25000.0   
2     3           Ford     Focus  2017        1.5L    Diesel       60000.0   
3     4        Hyundai   Elantra  2019         NaN    Petrol       35000.0   
4     5            BMW  3 Series  2016        2.0L    diesel       70000.0   
5     6  Mercedes-Benz   C Class  2018        2.0L    Petrol       40000.0   
6     7          Tesla   Model 3  2021    Electric  electric       15000.0   
7     8            Kia    Optima  2017        2.4L    Petrol       50000.0   
8     9           Audi        A4  2019        2.0L    Diesel       30000.0   
9    10      Chevrolet    Malibu  2016        1.5L    Petrol        -500.0   
10   11          Honda     Civic  2020        2.0L    Petrol       25000.0   
11   12           Ford     Focus  2017        1.5L    Diesel    

#### 4. Pandas Read CSV
- A simple way to store big data sets is to use CSV files (comma separated files).

- CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [14]:
df = pd.read_csv('cars_advanced.csv')
print(df.to_string())

    Car          Brand     Model  Year Engine Size Fuel Type  Mileage (km)  Price (USD) Owner Type
0     1         Toyota   Corolla  2018        1.8L    Petrol       42000.0      14500.0      First
1     2          Honda     Civic  2020        2.0L    petrol       25000.0      18000.0      First
2     3           Ford     Focus  2017        1.5L    Diesel       60000.0      12500.0     Second
3     4        Hyundai   Elantra  2019         NaN    Petrol       35000.0      16000.0      first
4     5            BMW  3 Series  2016        2.0L    diesel       70000.0          NaN      Third
5     6  Mercedes-Benz   C Class  2018        2.0L    Petrol       40000.0      28000.0      First
6     7          Tesla   Model 3  2021    Electric  electric       15000.0      35000.0      First
7     8            Kia    Optima  2017        2.4L    Petrol       50000.0      13000.0     Second
8     9           Audi        A4  2019        2.0L    Diesel       30000.0      27000.0     Second
9    10   

#### 4.1 max_rows
- The number of rows returned is defined in Pandas option settings.

- You can check your system's maximum rows with the **pd.options.display.max_rows** statement.

In [25]:
# pd.options.display.max_rows = 8 # uses pretty print thats why divides in even number of rows
pd.set_option("display.max_rows", 10) 
print(pd.options.display.max_rows)

df = pd.read_csv('cars_advanced.csv') # df has 15 rows
display(df)

10


Unnamed: 0,Car,Brand,Model,Year,Engine Size,Fuel Type,Mileage (km),Price (USD),Owner Type
0,1,Toyota,Corolla,2018,1.8L,Petrol,42000.0,14500.0,First
1,2,Honda,Civic,2020,2.0L,petrol,25000.0,18000.0,First
2,3,Ford,Focus,2017,1.5L,Diesel,60000.0,12500.0,Second
3,4,Hyundai,Elantra,2019,,Petrol,35000.0,16000.0,first
4,5,BMW,3 Series,2016,2.0L,diesel,70000.0,,Third
...,...,...,...,...,...,...,...,...,...
10,11,Honda,Civic,2020,2.0L,Petrol,25000.0,18000.0,First
11,12,Ford,Focus,2017,1.5L,Diesel,60000.0,12500.0,Second
12,13,,Accord,2018,2.4L,Petrol,45000.0,19000.0,First
13,14,Toyota,Camry,2015,2.5L,Petrol,,14000.0,Second


### 5 Pandas - Analyzing DataFrames

#### 5.1 Viewing the Data
- One of the most used method for getting a quick overview of the DataFrame, is the **head()** method.

- The **head()** method returns the headers and a specified number of rows, starting from the top.


In [32]:
df = pd.read_csv('cars_advanced.csv')
print(df.head(7))
display(df.head())
 #if the number of rows is not specified, it will return the top 5 rows.

   Car          Brand     Model  Year Engine Size Fuel Type  Mileage (km)  \
0    1         Toyota   Corolla  2018        1.8L    Petrol       42000.0   
1    2          Honda     Civic  2020        2.0L    petrol       25000.0   
2    3           Ford     Focus  2017        1.5L    Diesel       60000.0   
3    4        Hyundai   Elantra  2019         NaN    Petrol       35000.0   
4    5            BMW  3 Series  2016        2.0L    diesel       70000.0   
5    6  Mercedes-Benz   C Class  2018        2.0L    Petrol       40000.0   
6    7          Tesla   Model 3  2021    Electric  electric       15000.0   

   Price (USD) Owner Type  
0      14500.0      First  
1      18000.0      First  
2      12500.0     Second  
3      16000.0      first  
4          NaN      Third  
5      28000.0      First  
6      35000.0      First  


Unnamed: 0,Car,Brand,Model,Year,Engine Size,Fuel Type,Mileage (km),Price (USD),Owner Type
0,1,Toyota,Corolla,2018,1.8L,Petrol,42000.0,14500.0,First
1,2,Honda,Civic,2020,2.0L,petrol,25000.0,18000.0,First
2,3,Ford,Focus,2017,1.5L,Diesel,60000.0,12500.0,Second
3,4,Hyundai,Elantra,2019,,Petrol,35000.0,16000.0,first
4,5,BMW,3 Series,2016,2.0L,diesel,70000.0,,Third


- There is also a **tail()** method for viewing the last rows of the DataFrame.

- The **tail()** method returns the headers and a specified number of rows, starting from the bottom.

In [31]:
print(df.tail())

    Car   Brand     Model  Year Engine Size Fuel Type  Mileage (km)  \
10   11   Honda     Civic  2020        2.0L    Petrol       25000.0   
11   12    Ford     Focus  2017        1.5L    Diesel       60000.0   
12   13     NaN    Accord  2018        2.4L    Petrol       45000.0   
13   14  Toyota     Camry  2015        2.5L    Petrol           NaN   
14   15     BMW  3 series  2016        2.0L    Diesel       70000.0   

    Price (USD) Owner Type  
10      18000.0      First  
11      12500.0     Second  
12      19000.0      First  
13      14000.0     Second  
14      20000.0      third  


#### 5.2 Info About the Data
- The DataFrames object has a method called **info()**, that gives you more information about the data set.

- The **info()** method also tells us how many Non-Null values there are present in each column

In [33]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car           15 non-null     int64  
 1   Brand         14 non-null     object 
 2   Model         15 non-null     object 
 3   Year          15 non-null     int64  
 4   Engine Size   14 non-null     object 
 5   Fuel Type     15 non-null     object 
 6   Mileage (km)  14 non-null     float64
 7   Price (USD)   14 non-null     float64
 8   Owner Type    15 non-null     object 
dtypes: float64(2), int64(2), object(5)
memory usage: 1.2+ KB
None


### 6. Pandas - Cleaning Empty Cells
- Empty cells can potentially give you a wrong result when you analyze data.

#### 6.1 Remove Rows
- One way to deal with empty cells is to remove rows that contain empty cells.

- This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.

- By default, the **dropna()** method returns a new DataFrame, and will not change the original.

- If you want to change the original DataFrame, use the **inplace = True argument**

In [36]:
df = pd.read_csv('cars_advanced.csv')
new_df = df.dropna()

print(new_df.to_string())

    Car          Brand     Model  Year Engine Size Fuel Type  Mileage (km)  Price (USD) Owner Type
0     1         Toyota   Corolla  2018        1.8L    Petrol       42000.0      14500.0      First
1     2          Honda     Civic  2020        2.0L    petrol       25000.0      18000.0      First
2     3           Ford     Focus  2017        1.5L    Diesel       60000.0      12500.0     Second
5     6  Mercedes-Benz   C Class  2018        2.0L    Petrol       40000.0      28000.0      First
6     7          Tesla   Model 3  2021    Electric  electric       15000.0      35000.0      First
7     8            Kia    Optima  2017        2.4L    Petrol       50000.0      13000.0     Second
8     9           Audi        A4  2019        2.0L    Diesel       30000.0      27000.0     Second
9    10      Chevrolet    Malibu  2016        1.5L    Petrol        -500.0      11000.0      third
10   11          Honda     Civic  2020        2.0L    Petrol       25000.0      18000.0      First
11   12   

#### 6.2 Replace Empty Values
- Another way of dealing with empty cells is to insert a new value instead.

- This way you do not have to delete entire rows just because of some empty cells.

- The **fillna()** method allows us to replace empty cells with a value

#### 6.2.1 Replacing by a Static value

In [44]:
df.fillna({'Engine Size' : '1.5l'}, inplace=True)
df.fillna({'Price (USD)' : 16000.0}, inplace=True)
df.fillna({'Brand' : 'Honda'}, inplace=True)
df.fillna({'Mileage (km)' : 70000.0 }, inplace=True)

print(df.to_string())

    Car          Brand     Model  Year Engine Size Fuel Type  Mileage (km)  Price (USD) Owner Type
0     1         Toyota   Corolla  2018        1.8L    Petrol       42000.0      14500.0      First
1     2          Honda     Civic  2020        2.0L    petrol       25000.0      18000.0      First
2     3           Ford     Focus  2017        1.5L    Diesel       60000.0      12500.0     Second
3     4        Hyundai   Elantra  2019        1.5l    Petrol       35000.0      16000.0      first
4     5            BMW  3 Series  2016        2.0L    diesel       70000.0      16000.0      Third
5     6  Mercedes-Benz   C Class  2018        2.0L    Petrol       40000.0      28000.0      First
6     7          Tesla   Model 3  2021    Electric  electric       15000.0      35000.0      First
7     8            Kia    Optima  2017        2.4L    Petrol       50000.0      13000.0     Second
8     9           Audi        A4  2019        2.0L    Diesel       30000.0      27000.0     Second
9    10   

#### 6.2.2 Replace Using Mean, Median, or Mode

- A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

- Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column

In [51]:
df = pd.read_csv('cars_advanced.csv')

x = df["Brand"].mode()[0]
y = df["Engine Size"].mode()[0]
z = df["Mileage (km)"].mean()
w = df["Price (USD)"].median()

df.fillna({"Brand": x}, inplace=True)
df.fillna({"Engine Size": y}, inplace=True)
df.fillna({"Mileage (km)": z}, inplace=True)
df.fillna({"Price (USD)": w}, inplace=True)


print(df.to_string())

    Car          Brand     Model  Year Engine Size Fuel Type  Mileage (km)  Price (USD) Owner Type
0     1         Toyota   Corolla  2018        1.8L    Petrol  42000.000000      14500.0      First
1     2          Honda     Civic  2020        2.0L    petrol  25000.000000      18000.0      First
2     3           Ford     Focus  2017        1.5L    Diesel  60000.000000      12500.0     Second
3     4        Hyundai   Elantra  2019        2.0L    Petrol  35000.000000      16000.0      first
4     5            BMW  3 Series  2016        2.0L    diesel  70000.000000      17000.0      Third
5     6  Mercedes-Benz   C Class  2018        2.0L    Petrol  40000.000000      28000.0      First
6     7          Tesla   Model 3  2021    Electric  electric  15000.000000      35000.0      First
7     8            Kia    Optima  2017        2.4L    Petrol  50000.000000      13000.0     Second
8     9           Audi        A4  2019        2.0L    Diesel  30000.000000      27000.0     Second
9    10   