# Lesson 3D: Data Transformation

## Learning Objectives:
1.  Converting data types within a column.
2.  Ensure consistent formatting of datetime and categorical values.
3.  Setting up range-checks in a column.



## ▶️: Load & Explore the Dataset  
We will get started by loading pandas and the restaurant_transaction_processed_01.csv file (from lesson 3C).




In [2]:
import pandas as pd
df = pd.read_csv("restaurant_transaction_processed_01.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 989 entries, 0 to 988
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer_ID       989 non-null    int64  
 1   Food_Item         989 non-null    object 
 2   Category          989 non-null    object 
 3   Date_of_Visit     989 non-null    object 
 4   Time              989 non-null    object 
 5   Weather           989 non-null    object 
 6   Price             989 non-null    float64
 7   Weekend           989 non-null    object 
 8   Public_Holiday    989 non-null    object 
 9   Year              989 non-null    int64  
 10  Month             989 non-null    int64  
 11  Day               989 non-null    int64  
 12  Day_of_Week       989 non-null    object 
 13  Discounted_Price  989 non-null    float64
 14  Category_Encoded  989 non-null    int64  
dtypes: float64(2), int64(5), object(8)
memory usage: 116.0+ KB


##  1️⃣: Converting data types
Let's check all the data types of the columns in this table.


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 989 entries, 0 to 988
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer_ID       989 non-null    int64  
 1   Food_Item         989 non-null    object 
 2   Category          989 non-null    object 
 3   Date_of_Visit     989 non-null    object 
 4   Time              989 non-null    object 
 5   Weather           989 non-null    object 
 6   Price             989 non-null    float64
 7   Weekend           989 non-null    object 
 8   Public_Holiday    989 non-null    object 
 9   Year              989 non-null    int64  
 10  Month             989 non-null    int64  
 11  Day               989 non-null    int64  
 12  Day_of_Week       989 non-null    object 
 13  Discounted_Price  989 non-null    float64
 14  Category_Encoded  989 non-null    int64  
dtypes: float64(2), int64(5), object(8)
memory usage: 116.0+ KB


We can convert between data types using the 'astype()' function.
- astype(int) - converts to integer
- astype(float) - converts to float
- astype(str) - converts to string

The best practice in converting data types is to establish a new column while doing so.

Let's try to convert 'Price' into an integer.

In [11]:
df['Price'].astype(int)
df['Price'].head()

0    14.88
1     6.72
2    14.27
3    14.69
4     8.84
Name: Price, dtype: float64

Let's check with 'df.info() to verify the data types.

In [8]:
df['Price'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 989 entries, 0 to 988
Series name: Price
Non-Null Count  Dtype  
--------------  -----  
989 non-null    float64
dtypes: float64(1)
memory usage: 7.9 KB


## 2️⃣ Clean and Format Dates
In lesson 3C, you have learned how to use datetime and extracted the data into new columns. Here, you will also learn how to sync up the way date is displayed.

Note: dayfirst=True makes Python read dates as DD-MM-YYYY instead of the default MM-DD-YYYY.









In [19]:
# df['Default_Date'] = pd.to_datetime(df['Date_of_Visit'], errors ='coerce',dayfirst=False).dt.strftime('%m-%d-%Y')
# dayfirst=True means we let panda know the format in iso is DD-MM-YYYY for iso format DD-MM-YYYY @ MM-DD-YYYY
# but if date is in YYYY-MM-DD pandas already year is in front so no need date
# so in this case no need dayfirst=True

df['Default_Date'] = pd.to_datetime(df['Date_of_Visit'], errors ='coerce').dt.strftime('%m-%d-%Y')
df['Date_of_Visit'].head()


0    2023-03-24
1    2023-03-19
2    2023-03-24
3    2023-03-05
4    2023-03-29
Name: Date_of_Visit, dtype: object

## 3️⃣Formatting Categorical Values

Inconsistent text values in categorical columns can cause errors in data analysis.

For example, "Sunny", "sunny", and "SUNNY" should be transformed to look consistent across the dataset.


In [20]:
df['Weather'] = df['Weather'].str.lower().str.strip()
df['Weather'].head()

0      sunny
1    raining
2      sunny
3      sunny
4      sunny
Name: Weather, dtype: object

Here is another example where we shorten the names of the days into the first 3 letters (ie: 'Friday into 'Fri').

In [28]:
df['Day_of_Week'] = df['Day_of_Week'].str.title().str.strip().str[:3]
df['Day_of_Week'].unique

<bound method Series.unique of 0      Fri
1      Sun
2      Fri
3      Sun
4      Wed
      ... 
984    Thu
985    Thu
986    Sun
987    Sun
988    Mon
Name: Day_of_Week, Length: 989, dtype: object>

## 4️⃣Validating data ranges

There are times where we want to ensure the values are within a certain range.

For example, we want to ensure the values in 'Price' should not be missing or a negative value. If the values are out of range, we want to replace it with the median price.  


In [40]:
df['Price'] = df['Price'].apply(lambda x: x if pd.notnull(x) and x > 0 else df['Price'].median())
df[df['Price'] <= 0]
df['Price'].head()

0    14.88
1     6.72
2    14.27
3    14.69
4     8.84
Name: Price, dtype: float64

## 5️⃣: Export the Cleaned Dataset  

After applying the transformations, we save the cleaned dataset for further analysis.  


In [42]:
df.to_csv("restaurant_transaction_processed_02.csv")

## 🎯 Summary  

In this guided practice, we:  
✔ **Validated data types** to ensure correct formats for numerical and date columns  
✔ **Performed range checks** to identify and correct invalid values  
✔ **Standardized and cleaned data** by fixing categorical inconsistencies and handling missing values  

---  

