<a href="https://colab.research.google.com/github/chonginbilly/Moringa_DS/blob/Moringa_python/Understanding_Pandas_Series_and_DataFrames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="green">*To start working on this notebook, or any other notebook that we will use in this course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

---

# Understanding Pandas Series and DataFrames

## Introduction

Let's elevate our understanding by immersing ourselves in the intricate world of Pandas Series and DataFrames. These structures are the backbone of Pandas, providing a robust foundation for efficient data manipulation and analysis. In this lesson, we will harness the prowess of Pandas Series to handle one-dimensional labeled data, while simultaneously unlocking the potential of DataFrames to seamlessly manage two-dimensional tabular data. Through hands-on exploration and active engagement, we'll unravel the intricacies of creating, manipulating, and extracting valuable insights from these essential Pandas data structures. Get ready to empower your data-handling skills as we navigate through the fascinating realm of Pandas Series and DataFrames.

## Objectives

By the end of this lesson, you will be able to:

- Create and manipulate Pandas Series, employing various data structures and performing basic operations.
- Construct, modify, and comprehend Pandas DataFrames, exploring operations on columns and understanding the DataFrame structure.
- Apply functions to DataFrames strategically, using methods such as `apply()`, `map()`, and `applymap()` for versatile data manipulation.
- Handle dates and times efficiently in Pandas, converting formats, extracting information, and performing operations.
- Manage and customize DataFrame indices, mastering techniques like setting/resetting the index and changing it to a specific column.


## import the libraries

In [None]:
import pandas as pd
import numpy as np

import os

## Pandas Data Types vs. Base Python Data Types

Built-in Python data types such as lists, dictionaries, and sets can be powerful in limited settings, but they often require:

- Several lines of "boilerplate" code to accomplish common tasks, which opens up the possibility of mistakes
- Extra unnecessary memory space for storing data types. For example, if you have a Python list of 100 integers, you are also storing the fact that each one is an integer, and you store that same information again if you increase the length of the list by 1

Using pandas data types such as Series and DataFrames instead of built-in Python data types can address both of these issues. Series and DataFrames have a range of built-in methods which make standard practices and procedures streamlined. Some of these methods can result in dramatic performance gains. To read more about these methods, make sure to continuously reference the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/).

With built-in Python types, it is useful to know all of the available methods, since each of them is likely to come up at one point or another, and there aren't that many. **In pandas, by contrast, it is impossible to know every method at any given time, and you should not devote much time to memorization.** We will not deeply explain every pandas method in these upcoming lessons and labs. A critical part of every data scientist's job is to investigate documentation to learn about components of these tools on your own. When you are trying to do something new with your data, there will probably be a pandas method for it, and you'll work over time to get better at finding the appropriate method using the documentation, Google, and StackOverflow.

## Mount google drive

In [None]:
from google.colab import drive

# mount your google drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load and preview the dataset

The dataset from [Kaggle](https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales#) contains historical sales of a supermarket company which was recorded in 3 different branches for a period of 3 months. It forms a great place for us to get our hands dirty.

### Attribute information

- **Invoice id**: Computer generated sales slip invoice identification number
- **Branch**: Branch of supercenter (3 branches are available identified by A, B and C).
- **City**: Location of supercenters
- **Customer type**: Type of customers, recorded by Members for customers using member card and Normal for without member card.
- **Gender**: Gender type of customer
- **Product line**: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
- **Unit price**: Price of each product in $
- **Quantity**: Number of products purchased by customer
- **Tax**: 5% tax fee for customer buying
- **Total**: Total price including tax
- **Date**: Date of purchase (Record available from January 2019 to March 2019)
- **Time**: Purchase time (10am to 9pm)
- **Payment**: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
- **COGS**: Cost of goods sold
- **Gross margin percentage**: Gross margin percentage
- **Gross income**: Gross income
- **Rating**: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

In [None]:
# path to data folder
data_path = "/content/drive/MyDrive/Product/Naivas Big Data /Data"

# change directory to data folder
os.chdir(data_path)

files = os.listdir()
print("Files in the directory:", files)

Files in the directory: ['supermarket_sales - Sheet1.csv', 'results.xlsx', 'Iris', 'Wine', 'WorldCupMatches.csv', 'csv', 'results.csv', 'Zipcode_Demos.xlsx', 'Car data', 'Online Retail.xlsx', 'titanic.csv']


In [None]:
# loading the supermartket data
df = pd.read_csv("supermarket_sales - Sheet1.csv")

In [None]:
# preview the data
df.head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


In [None]:
# brief description
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Invoice ID               1000 non-null   object 
 1   Branch                   1000 non-null   object 
 2   City                     1000 non-null   object 
 3   Customer type            1000 non-null   object 
 4   Gender                   1000 non-null   object 
 5   Product line             1000 non-null   object 
 6   Unit price               1000 non-null   float64
 7   Quantity                 1000 non-null   int64  
 8   Tax 5%                   1000 non-null   float64
 9   Total                    1000 non-null   float64
 10  Date                     1000 non-null   object 
 11  Time                     1000 non-null   object 
 12  Payment                  1000 non-null   object 
 13  cogs                     1000 non-null   float64
 14  gross margin percentage  

## Working With Pandas Series

A series is a one-dimensional labeled array that can store any data type. Each element in a Series is associated with a label or index, facilitating easy and efficient data retrieval.

Basically, a series is just a single column.

In [None]:
# creating a series from a list
hours = [10, 21, 32, 12, 14, 25, 31]

hours_series = pd.Series(hours, name="working_hours")
hours_series

0    10
1    21
2    32
3    12
4    14
5    25
6    31
Name: working_hours, dtype: int64

We can modify the labels of the series as follows:

In [None]:
# creating series from an array
sales_array = np.array([100, 205, 278, 360, 245, 312, 109])

sales_series = pd.Series(sales_array, index=['a', 'b', 'c', 'e', 'f', 'g', 'h'], name="sales")
sales_series

a    100
b    205
c    278
e    360
f    245
g    312
h    109
Name: sales, dtype: int64

The length of the labels has to match the length of the items in the series.

In [None]:
sales_series.loc["e"]

360

We can perfrom arithmetic operations on a series, such as adition, subtraction and division. its important to not that the operations will be applied on every element of the series unless we specify the specific elements that we want to carry out these operations on

In [None]:
# multiply the sales by 2
multiplied = sales_series * 2
multiplied

a    200
b    410
c    556
e    720
f    490
g    624
h    218
Name: sales, dtype: int64

## Using `.map()` to Transform Values

A standard data preparation step you might need to perform is "cleaning up" the values of a dataset so they follow your desired format. The `.map()` method is key for this task. The `map()` method in pandas is a convenient and efficient way to transform data within a Series or DataFrame. It is primarily used for mapping values from one set to another based on a defined mapping or a provided function. The main purpose is to apply a function or a mapping dictionary element-wise to each element in the Series or DataFrame, producing a new Series or DataFrame with the transformed values.


### Passing in a Dictionary

One of the most straightforward ways to use the `.map()` method on a pandas Series is with a dictionary of values you want to use to replace other values.

Let's say we want to look at the `Branch` column:

In [None]:
df["Branch"].value_counts()

A    340
B    332
C    328
Name: Branch, dtype: int64

We will use the `value_counts()` very frequently to understand the distribution of categorical data.

Let's create a dictionary of different branches that could be represented with the letters `A`,`B` and `C` using python.

In [None]:
dict_mapping = {
    "A" : "Naivas Juja City Mall",
    "B" : "Naivas Spur Mall",
    "C" : "Naivas Mountain View"
}

Now we can call the `.map()` method to return a Series with the abbreviations transformed into full names:

In [None]:
df["Branch"].map(dict_mapping)

0      Naivas Juja City Mall
1       Naivas Mountain View
2      Naivas Juja City Mall
3      Naivas Juja City Mall
4      Naivas Juja City Mall
               ...          
995     Naivas Mountain View
996         Naivas Spur Mall
997    Naivas Juja City Mall
998    Naivas Juja City Mall
999    Naivas Juja City Mall
Name: Branch, Length: 1000, dtype: object

Let's go ahead and replace the `Branch` column in df with these new, transformed values:

In [None]:
df['Branch'] = df['Branch'].map(dict_mapping)

df['Branch'].value_counts()

Naivas Juja City Mall    340
Naivas Spur Mall         332
Naivas Mountain View     328
Name: Branch, dtype: int64

### Passing in a Function

Another way to use the `.map()` method is by passing in a function. It can be used with a function that takes one input and produces one output.

Let's have a look at the `Rating` column:

In [None]:
df["Rating"]

0      9.1
1      9.6
2      7.4
3      8.4
4      5.3
      ... 
995    6.2
996    4.4
997    7.7
998    4.1
999    6.6
Name: Rating, Length: 1000, dtype: float64

**Functions in Python Review**

Let's review how to do this:

- In Python, we define a function using the `def` keyword. Afterwards, we give the function a **name**, followed by parentheses. Any required (or optional) parameters are specified within the parentheses (`()`), just as you would when you call a function.
- You then specify the function's behavior using a colon (`:`) and an indentation, much the same way you would a for loop or conditional block.
- Finally, if you want your function to return something (as with the `str.pop()` method) as opposed to a function that simply does something in the background but returns nothing (such as `list.append()`), you must use the `return` keyword. Note that as soon as a function hits a point in execution where something is returned, the function would terminate and no further commands would be executed. In other words the `return` command both returns a value and forces termination of the function.

Let's create a function called `classify_customer_ratings` that will help us categorize customer ratings. When we use this function, it takes a customer rating as input. If the rating is **8** or higher, the customer is labeled as a *Promoter*. If the rating falls between **6** and **8**, they are considered *Passive*. Any rating below **6** gets the classification of *Detractor*.

In [None]:
def classify_customer_rating(rating):
  if rating >= 8:
    return "Promoter"
  elif 6 <= rating < 8:
    return "Passive"
  else:
    return "Detractor"

Then call the `.map()` method and pass in the function:

In [None]:
df["Rating"].map(classify_customer_rating)

0       Promoter
1       Promoter
2        Passive
3       Promoter
4      Detractor
         ...    
995      Passive
996    Detractor
997      Passive
998    Detractor
999      Passive
Name: Rating, Length: 1000, dtype: object

Creating a new column that will contain this new information.

In [None]:
df["customer_category"] = df["Rating"].map(classify_customer_rating)

df['customer_category'].value_counts()

Passive      356
Promoter     329
Detractor    315
Name: customer_category, dtype: int64

## Using `apply()` Method in Pandas for Data Transformation

The `apply()` method in pandas is a versatile and powerful function that allows us to apply a custom function along the axis of a DataFrame or Series. It is a fundamental tool for data manipulation and is often used for complex operations, transformations, and aggregations.


### Function Application

Just as `map()`, the primary purpose of `apply()` is to apply a function along the rows or columns of a DataFrame or the elements of a Series.

In [None]:
df["Rating"].apply(classify_customer_rating)

0       Promoter
1       Promoter
2        Passive
3       Promoter
4      Detractor
         ...    
995      Passive
996    Detractor
997      Passive
998    Detractor
999      Passive
Name: Rating, Length: 1000, dtype: object

### Passing Lambda Functions

`apply()` is often used with lambda functions for concise, inline transformations. This allows us to write custom operations without defining a separate function.

In [None]:
# creating a new column "Total Sales"
df['Total Sales'] = df.apply(lambda row: row['Quantity'] * row['Unit price'], axis=1)

In [None]:
df['Total Sales']

0      522.83
1       76.40
2      324.31
3      465.76
4      604.17
        ...  
995     40.35
996    973.80
997     31.84
998     65.82
999    618.38
Name: Total Sales, Length: 1000, dtype: float64

## `applymap()`

When using `applymap()`, we need to be extra careful because the `applymap()` method in pandas is used to apply a function element-wise to every element in a DataFrame. This means that it performs the specified operation on each individual value in the DataFrame, regardless of whether the value is located in a row or column. Unlike `apply()`, which is used to operate on entire rows or columns, `applymap()` focuses on element-wise transformations.

Suppose you want to convert all numerical values in the DataFrame to their corresponding square roots:

In [None]:
# Sample DataFrame from dict
data = {
  'Unit price': [10, 20, 15, 25, 30],
  'Quantity': [5, 8, 10, 3, 6],
  'Tax': [1, 2, 1.5, 2.5, 3],
  'Total': [55, 160, 165, 77.5, 210],
  'Gross income': [5.5, 16, 16.5, 7.75, 21],
}

df_sample = pd.DataFrame(data)

df_sample

Unnamed: 0,Unit price,Quantity,Tax,Total,Gross income
0,10,5,1.0,55.0,5.5
1,20,8,2.0,160.0,16.0
2,15,10,1.5,165.0,16.5
3,25,3,2.5,77.5,7.75
4,30,6,3.0,210.0,21.0


In [None]:
# Define a function to calculate square root
def square_root(x):
  return x ** 0.5

square_root(64)

8.0

In [None]:
# Apply the square_root function element-wise using applymap
df_transformed = df_sample.applymap(square_root)

# Display the resulting transformed DataFrame
df_transformed

Unnamed: 0,Unit price,Quantity,Tax,Total,Gross income
0,3.162278,2.236068,1.0,7.416198,2.345208
1,4.472136,2.828427,1.414214,12.649111,4.0
2,3.872983,3.162278,1.224745,12.845233,4.062019
3,5.0,1.732051,1.581139,8.803408,2.783882
4,5.477226,2.44949,1.732051,14.491377,4.582576


The `square_root` function is applied element-wise to each value in the DataFrame using `applymap()`. The result is a new DataFrame with the square root of each numerical value.

`applymap()` is particularly useful when you need to perform a simple operation on every element in a DataFrame without the need for row or column-wise operations.

## Manipulating DataFrame Columns

Let's look at the column names that we have:

In [None]:
df.columns

Index(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender',
       'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date',
       'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income',
       'Rating', 'customer_category', 'Total Sales'],
      dtype='object')

While our dataframe has no annoying white spaces, its important we ensure that there are no extra white spaces in our column names. `"City "` is not the same as `"City"` and this can really make you want to smash your computer or drop it in a bucket full of water, hehe!

We can quickly use a list comprehension to clean up all of the column names.

In [None]:
[col.strip() for col in df.columns]

['Invoice ID',
 'Branch',
 'City',
 'Customer type',
 'Gender',
 'Product line',
 'Unit price',
 'Quantity',
 'Tax 5%',
 'Total',
 'Date',
 'Time',
 'Payment',
 'cogs',
 'gross margin percentage',
 'gross income',
 'Rating',
 'customer_category',
 'Total Sales']

Alternatively we can use:

In [None]:
df.columns.str.strip()

Index(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender',
       'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date',
       'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income',
       'Rating', 'customer_category', 'Total Sales'],
      dtype='object')

### Renaming columns

Using dictionaries and `rename()` method we can be able to change the column names to our liking.

Let's try and rename the column `Time` to `Purchase` time:

In [None]:
df.rename(columns={"Time":"Purchase time"})

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Purchase time,Payment,cogs,gross margin percentage,gross income,Rating,customer_category,Total Sales
0,750-67-8428,Naivas Juja City Mall,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1,Promoter,522.83
1,226-31-3081,Naivas Mountain View,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29,Cash,76.40,4.761905,3.8200,9.6,Promoter,76.40
2,631-41-3108,Naivas Juja City Mall,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4,Passive,324.31
3,123-19-1176,Naivas Juja City Mall,Yangon,Member,Male,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,20:33,Ewallet,465.76,4.761905,23.2880,8.4,Promoter,465.76
4,373-73-7910,Naivas Juja City Mall,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3,Detractor,604.17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,Naivas Mountain View,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2,Passive,40.35
996,303-96-2227,Naivas Spur Mall,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,17:16,Ewallet,973.80,4.761905,48.6900,4.4,Detractor,973.80
997,727-02-1313,Naivas Juja City Mall,Yangon,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,13:22,Cash,31.84,4.761905,1.5920,7.7,Passive,31.84
998,347-56-2442,Naivas Juja City Mall,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,15:33,Cash,65.82,4.761905,3.2910,4.1,Detractor,65.82


Again, note that the dataframe was not automatically transformed by doing this. If we look at it now, `Time` is still there:



In [None]:
df.columns

Index(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender',
       'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date',
       'Time', 'Payment', 'cogs', 'gross margin percentage', 'gross income',
       'Rating', 'customer_category', 'Total Sales'],
      dtype='object')

If we want the change to "stick", one way to do that is to use `inplace=True`:


In [None]:
df.rename(columns={"Time":"Purchase time"}, inplace=True)

In [None]:
df.columns

Index(['Invoice ID', 'Branch', 'City', 'Customer type', 'Gender',
       'Product line', 'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Date',
       'Purchase time', 'Payment', 'cogs', 'gross margin percentage',
       'gross income', 'Rating', 'customer_category', 'Total Sales'],
      dtype='object')

### Dropping columns

Sometimes we will have to drop column(s) in the dataframe if we are determined that this columns don't matter.

In [None]:
df.drop("City", axis=1)

Unnamed: 0,Invoice ID,Branch,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Purchase time,Payment,cogs,gross margin percentage,gross income,Rating,customer_category,Total Sales
0,750-67-8428,Naivas Juja City Mall,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1,Promoter,522.83
1,226-31-3081,Naivas Mountain View,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29,Cash,76.40,4.761905,3.8200,9.6,Promoter,76.40
2,631-41-3108,Naivas Juja City Mall,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4,Passive,324.31
3,123-19-1176,Naivas Juja City Mall,Member,Male,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,20:33,Ewallet,465.76,4.761905,23.2880,8.4,Promoter,465.76
4,373-73-7910,Naivas Juja City Mall,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3,Detractor,604.17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,Naivas Mountain View,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2,Passive,40.35
996,303-96-2227,Naivas Spur Mall,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,17:16,Ewallet,973.80,4.761905,48.6900,4.4,Detractor,973.80
997,727-02-1313,Naivas Juja City Mall,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,13:22,Cash,31.84,4.761905,1.5920,7.7,Passive,31.84
998,347-56-2442,Naivas Juja City Mall,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,15:33,Cash,65.82,4.761905,3.2910,4.1,Detractor,65.82


Note the `axis=1` argument. By default, `df.drop()` tries to drop rows (`axis=0`) with the specified index, e.g.

In [None]:
df.drop(3).head()

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Purchase time,Payment,cogs,gross margin percentage,gross income,Rating,customer_category,Total Sales
0,750-67-8428,Naivas Juja City Mall,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1,Promoter,522.83
1,226-31-3081,Naivas Mountain View,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6,Promoter,76.4
2,631-41-3108,Naivas Juja City Mall,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4,Passive,324.31
4,373-73-7910,Naivas Juja City Mall,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3,Detractor,604.17
5,699-14-3026,Naivas Mountain View,Naypyitaw,Normal,Male,Electronic accessories,85.39,7,29.8865,627.6165,3/25/2019,18:30,Ewallet,597.73,4.761905,29.8865,4.1,Detractor,597.73


Let's go ahead and permanently drop that column:

In [None]:
df.drop("City", axis=1, inplace=True)

df.head()

Unnamed: 0,Invoice ID,Branch,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Purchase time,Payment,cogs,gross margin percentage,gross income,Rating,customer_category,Total Sales
0,750-67-8428,Naivas Juja City Mall,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1,Promoter,522.83
1,226-31-3081,Naivas Mountain View,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6,Promoter,76.4
2,631-41-3108,Naivas Juja City Mall,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4,Passive,324.31
3,123-19-1176,Naivas Juja City Mall,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4,Promoter,465.76
4,373-73-7910,Naivas Juja City Mall,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3,Detractor,604.17


### Changing the data type of a column

Its very important that we work with the correct data type for each column.

In [None]:
df.dtypes

Invoice ID                  object
Branch                      object
Customer type               object
Gender                      object
Product line                object
Unit price                 float64
Quantity                     int64
Tax 5%                     float64
Total                      float64
Date                        object
Purchase time               object
Payment                     object
cogs                       float64
gross margin percentage    float64
gross income               float64
Rating                     float64
customer_category           object
Total Sales                float64
dtype: object

All the columns in our data are of the correct data type apart from `Date` which we are going to look at later on. How about we modify the `Invoice ID` column.

First we have to remove the `-`

In [None]:
# reomving the -

df["New Invoice"] = df['Invoice ID'].str.replace('-', '')
df["New Invoice"]

0      750678428
1      226313081
2      631413108
3      123191176
4      373737910
         ...    
995    233675758
996    303962227
997    727021313
998    347562442
999    849093807
Name: New Invoice, Length: 1000, dtype: object

Now we can convert to dtype `int`.

In [None]:
df["New Invoice"] = df["New Invoice"].astype(int)

In [None]:
# confirm the data types
df.dtypes

Invoice ID                  object
Branch                      object
Customer type               object
Gender                      object
Product line                object
Unit price                 float64
Quantity                     int64
Tax 5%                     float64
Total                      float64
Date                        object
Purchase time               object
Payment                     object
cogs                       float64
gross margin percentage    float64
gross income               float64
Rating                     float64
customer_category           object
Total Sales                float64
New Invoice                  int64
dtype: object

We have successfully managed to convert the column `New Invoice` into an `int`. The column contains computer generated sales slip invoice identification number, and as such it has no numerical meaning, therefore we should drop it

In [None]:
df.drop("New Invoice", axis=1, inplace=True)

## Handling Date and Time

Date and time are a slightly more complicated datatype. Handling datetime data in pandas is crucial for various data analysis and manipulation tasks. Pandas provides robust functionality to work with dates and times efficiently.

Let's have a look at the `Date` column:

In [None]:
df["Date"]

0       1/5/2019
1       3/8/2019
2       3/3/2019
3      1/27/2019
4       2/8/2019
         ...    
995    1/29/2019
996     3/2/2019
997     2/9/2019
998    2/22/2019
999    2/18/2019
Name: Date, Length: 1000, dtype: object

We can use `pd.to_datetime`, a very handy method when converting object to datetime object.

In [None]:
def string_to_date(data, date, date_format):

  data[date] = pd.to_datetime(data[date], format=date_format, errors = "coerce")

  return data

In this function, we convert a column containing date strings in a DataFrame to a datetime format. We use the `pd.to_datetime()` method to achieve this transformation. Specifically, we provide the function with the DataFrame (`data`), the name of the column containing date strings (date), and the desired date format (`date_format`). Find [here](https://www.w3schools.com/python/python_datetime.asp) a summarized table of the different date formats.

We utilize the `pd.to_datetime()` function to parse the date strings in the specified column according to the provided date format. The `errors="coerce"` parameter is employed to gracefully handle any conversion errors, replacing problematic entries with `NaT` (Not a Time) values.

In [None]:
df = string_to_date(df, "Date", "%m/%d/%Y")

df["Date"]

0     2019-01-05
1     2019-03-08
2     2019-03-03
3     2019-01-27
4     2019-02-08
         ...    
995   2019-01-29
996   2019-03-02
997   2019-02-09
998   2019-02-22
999   2019-02-18
Name: Date, Length: 1000, dtype: datetime64[ns]

Now that we have converted the `Date` field to a datetime object we can use some handy built-in methods.

For example, finding the name of the day of the week:

In [None]:
# Make a sample of rows so we can see various dates

date_sample = df['Date'].sample(n=10, random_state=0)

date_sample

993   2019-02-22
859   2019-01-23
298   2019-01-25
553   2019-03-07
672   2019-03-02
971   2019-02-10
27    2019-03-10
231   2019-03-12
306   2019-03-30
706   2019-01-31
Name: Date, dtype: datetime64[ns]

In [None]:
# .dt stores all the pandas datetime methods (only works for datetime columns)

date_sample.dt.day_name()

993       Friday
859    Wednesday
298       Friday
553     Thursday
672     Saturday
971       Sunday
27        Sunday
231      Tuesday
306     Saturday
706     Thursday
Name: Date, dtype: object

### Datetime Index



We can set the datetime column as the index of the DataFrame, which allows for convenient time-based indexing and slicing.

In [None]:
df.set_index('Date', inplace=True)

In [None]:
df.head()

Unnamed: 0_level_0,Invoice ID,Branch,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Purchase time,Payment,cogs,gross margin percentage,gross income,Rating,customer_category,Total Sales
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2019-01-05,750-67-8428,Naivas Juja City Mall,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,13:08,Ewallet,522.83,4.761905,26.1415,9.1,Promoter,522.83
2019-03-08,226-31-3081,Naivas Mountain View,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,10:29,Cash,76.4,4.761905,3.82,9.6,Promoter,76.4
2019-03-03,631-41-3108,Naivas Juja City Mall,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,13:23,Credit card,324.31,4.761905,16.2155,7.4,Passive,324.31
2019-01-27,123-19-1176,Naivas Juja City Mall,Member,Male,Health and beauty,58.22,8,23.288,489.048,20:33,Ewallet,465.76,4.761905,23.288,8.4,Promoter,465.76
2019-02-08,373-73-7910,Naivas Juja City Mall,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,10:37,Ewallet,604.17,4.761905,30.2085,5.3,Detractor,604.17


In [None]:
# resetting the index
df.reset_index(inplace=True)

### Filtering by Date


Let's filter the DataFrame based on specific date ranges using boolean indexing.

In [None]:
filtered_data = df[(df['Date'] >= '2019-02-01') & (df['Date'] < '2019-02-10')]

In [None]:
# preview filtered data
filtered_data

Unnamed: 0,Date,Invoice ID,Branch,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Purchase time,Payment,cogs,gross margin percentage,gross income,Rating,customer_category,Total Sales
4,2019-02-08,373-73-7910,Naivas Juja City Mall,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,10:37,Ewallet,604.17,4.761905,30.2085,5.3,Detractor,604.17
10,2019-02-06,351-62-0822,Naivas Spur Mall,Member,Female,Fashion accessories,14.48,4,2.8960,60.8160,18:07,Ewallet,57.92,4.761905,2.8960,4.5,Detractor,57.92
13,2019-02-07,252-56-2699,Naivas Juja City Mall,Normal,Male,Food and beverages,43.19,10,21.5950,453.4950,16:48,Ewallet,431.90,4.761905,21.5950,8.2,Promoter,431.90
26,2019-02-08,649-29-6775,Naivas Spur Mall,Normal,Male,Fashion accessories,33.52,1,1.6760,35.1960,15:31,Cash,33.52,4.761905,1.6760,6.7,Passive,33.52
34,2019-02-06,183-56-6882,Naivas Mountain View,Member,Female,Food and beverages,99.42,4,19.8840,417.5640,10:42,Ewallet,397.68,4.761905,19.8840,7.5,Passive,397.68
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
961,2019-02-05,324-92-3863,Naivas Juja City Mall,Member,Male,Electronic accessories,20.89,2,2.0890,43.8690,18:45,Cash,41.78,4.761905,2.0890,9.8,Promoter,41.78
974,2019-02-07,744-82-9138,Naivas Mountain View,Normal,Male,Fashion accessories,86.13,2,8.6130,180.8730,17:59,Cash,172.26,4.761905,8.6130,8.2,Promoter,172.26
979,2019-02-04,151-33-7434,Naivas Spur Mall,Normal,Female,Food and beverages,67.77,1,3.3885,71.1585,20:43,Credit card,67.77,4.761905,3.3885,6.5,Passive,67.77
985,2019-02-07,374-38-5555,Naivas Spur Mall,Normal,Female,Fashion accessories,63.71,5,15.9275,334.4775,19:30,Ewallet,318.55,4.761905,15.9275,8.5,Promoter,318.55


## Summary

In this lesson, we explored the distinctions between data types in Pandas, such as Series and DataFrames, and the native Python data types like dictionaries and lists. We then proceeded to discuss techniques for altering the values within a Pandas Series, modifying the columns of a Pandas DataFrame, and ultimately adjusting the DataFrame index.Additionally, we incorporated the concept of time, examining how Pandas handles datetime data.