# Lesson 3D: Data Transformation

## Learning Objectives:
1.  Converting data types within a column.
2.  Ensure consistent formatting of datetime and categorical values.
3.  Setting up range-checks in a column.



## ▶️: Load & Explore the Dataset  
We will get started by loading pandas and the restaurant_transaction_processed_01.csv file (from lesson 3C).




In [15]:
import pandas as pd

file_path = "restaurant_transaction_processed_01.csv"
df = pd.read_csv(file_path)

df.head()

Unnamed: 0,Customer_ID,Food_Item,Category,Date_of_Visit,Time,Weather,Price,Weekend,Public_Holiday,Year,Month,Day,Day_of_Week,Discounted_Price,Category_Encoded
0,1075,Smoothie,Cold,2023-03-24,12:30,Sunny,14.88,No,No,2023,3,24,Friday,13.39,0
1,1030,Soup,Hot,2023-03-19,14:30,Raining,6.7176,Yes,No,2023,3,19,Sunday,6.05,1
2,1055,Ice Cream,Cold,2023-03-24,10:30,Sunny,14.27,No,No,2023,3,24,Friday,12.84,0
3,1058,Ice Cream,Cold,2023-03-05,22:00,Sunny,14.688,Yes,No,2023,3,5,Sunday,13.22,0
4,1084,Smoothie,Cold,2023-03-29,16:30,Sunny,8.844,No,Yes,2023,3,29,Wednesday,7.96,0


##  1️⃣: Converting data types
Let's check all the data types of the columns in this table.


In [14]:
print(df.dtypes)

Customer_ID           int64
Food_Item            object
Category             object
Date_of_Visit        object
Time                 object
Weather              object
Price               float64
Weekend              object
Public_Holiday       object
Year                  int64
Month                 int64
Day                   int64
Day_of_Week          object
Discounted_Price    float64
Category_Encoded      int64
dtype: object


We can convert between data types using the 'astype()' function.
- astype(int) - converts to integer
- astype(float) - converts to float
- astype(str) - converts to string

The best practice in converting data types is to establish a new column while doing so.

Let's try to convert 'Price' into an integer.

In [21]:
df["Price"] = df["Price"].astype(int)

print(df.dtypes)

Customer_ID           int64
Food_Item            object
Category             object
Date_of_Visit        object
Time                 object
Weather              object
Price                 int64
Weekend              object
Public_Holiday       object
Year                  int64
Month                 int64
Day                   int64
Day_of_Week          object
Discounted_Price    float64
Category_Encoded      int64
dtype: object


Let's check with 'df.info() to verify the data types.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 989 entries, 0 to 988
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer_ID       989 non-null    int64  
 1   Food_Item         989 non-null    object 
 2   Category          989 non-null    object 
 3   Date_of_Visit     989 non-null    object 
 4   Time              989 non-null    object 
 5   Weather           989 non-null    object 
 6   Price             989 non-null    int64  
 7   Weekend           989 non-null    object 
 8   Public_Holiday    989 non-null    object 
 9   Year              989 non-null    int64  
 10  Month             989 non-null    int64  
 11  Day               989 non-null    int64  
 12  Day_of_Week       989 non-null    object 
 13  Discounted_Price  989 non-null    float64
 14  Category_Encoded  989 non-null    int64  
dtypes: float64(1), int64(6), object(8)
memory usage: 116.0+ KB


## 2️⃣ Clean and Format Dates
In lesson 3C, you have learned how to use datetime and extracted the data into new columns. Here, you will also learn how to sync up the way date is displayed.

Note: dayfirst=True makes Python read dates as DD-MM-YYYY instead of the default MM-DD-YYYY.









In [47]:
# df['Date_of_Visit'].dtype
df['Default_Date'] = pd.to_datetime(df['Date_of_Visit'], errors ='coerce',dayfirst=True).dt.strftime('%m-%d-%Y')
df.head()

  df['Default_Date'] = pd.to_datetime(df['Date_of_Visit'], errors ='coerce',dayfirst=True).dt.strftime('%m-%d-%Y')


Unnamed: 0,Customer_ID,Food_Item,Category,Date_of_Visit,Time,Weather,Price,Weekend,Public_Holiday,Year,Month,Day,Day_of_Week,Discounted_Price,Category_Encoded,Default_Date
0,1075,Smoothie,Cold,2023-03-24,12:30,Sunny,14,No,No,2023,3,24,Friday,13.39,0,03-24-2023
1,1030,Soup,Hot,2023-03-19,14:30,Raining,6,Yes,No,2023,3,19,Sunday,6.05,1,03-19-2023
2,1055,Ice Cream,Cold,2023-03-24,10:30,Sunny,14,No,No,2023,3,24,Friday,12.84,0,03-24-2023
3,1058,Ice Cream,Cold,2023-03-05,22:00,Sunny,14,Yes,No,2023,3,5,Sunday,13.22,0,03-05-2023
4,1084,Smoothie,Cold,2023-03-29,16:30,Sunny,8,No,Yes,2023,3,29,Wednesday,7.96,0,03-29-2023


## 3️⃣Formatting Categorical Values

Inconsistent text values in categorical columns can cause errors in data analysis.

For example, "Sunny", "sunny", and "SUNNY" should be transformed to look consistent across the dataset.


In [56]:
df.info()
df['Weather'].head()

df['Weather'] = df['Weather'].str.lower().str.strip()

df['Weather'].head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 989 entries, 0 to 988
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer_ID       989 non-null    int64  
 1   Food_Item         989 non-null    object 
 2   Category          989 non-null    object 
 3   Date_of_Visit     989 non-null    object 
 4   Time              989 non-null    object 
 5   Weather           989 non-null    object 
 6   Price             989 non-null    int64  
 7   Weekend           989 non-null    object 
 8   Public_Holiday    989 non-null    object 
 9   Year              989 non-null    int64  
 10  Month             989 non-null    int64  
 11  Day               989 non-null    int64  
 12  Day_of_Week       989 non-null    object 
 13  Discounted_Price  989 non-null    float64
 14  Category_Encoded  989 non-null    int64  
 15  Default_Date      989 non-null    object 
dtypes: float64(1), int64(6), object(9)
memory us

0      sunny
1    raining
2      sunny
3      sunny
4      sunny
Name: Weather, dtype: object

Here is another example where we shorten the names of the days into the first 3 letters (ie: 'Friday into 'Fri').

In [60]:
df.head()
df['Day_of_Week'] = df['Day_of_Week'].str[:3]
df['Day_of_Week'].head(10)

0    Fri
1    Sun
2    Fri
3    Sun
4    Wed
5    Fri
6    Tue
7    Fri
8    Tue
9    Sat
Name: Day_of_Week, dtype: object

## 4️⃣Validating data ranges

There are times where we want to ensure the values are within a certain range.

For example, we want to ensure the values in 'Price' should not be missing or a negative value. If the values are out of range, we want to replace it with the median price.  


In [50]:
df['Price'] = df['Price'].apply(lambda x: x if pd.notnull(x) and x > 0 else df['Price'].median())
df[df['Price'] <= 0]

df['Price'].head()

0    14
1     6
2    14
3    14
4     8
Name: Price, dtype: int64

## 5️⃣: Export the Cleaned Dataset  

After applying the transformations, we save the cleaned dataset for further analysis.  


In [61]:
df.to_csv("restaurant_transaction_processed_02.csv", index=False)

## 🎯 Summary  

In this guided practice, we:  
✔ **Validated data types** to ensure correct formats for numerical and date columns  
✔ **Performed range checks** to identify and correct invalid values  
✔ **Standardized and cleaned data** by fixing categorical inconsistencies and handling missing values  

---  

