# data_cleanning
* This line imports the pandas library.
*  This line reads the Chipotle TSV (Tab-Separated Values) file into a pandas DataFrame using pd.read_csv. The sep='\t' parameter specifies that the values are separated by tabs.


In [None]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Copy of chipotle.tsv', sep='\t')

**Missing Values:**
*  Use df.isnull().sum() to check for missing values in each column.
* Handle missing values based on the context. For example, you might drop rows with missing values or fill them with appropriate values.

This checks for missing values in each column and stores the count of missing values in the missing_values variable.

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

**Data Types:**
*  Use df.dtypes to verify the data types of each column.
* Adjust data types if needed using astype().


This line removes the dollar sign from the 'Item Price' column and converts it to a float data type.

In [None]:
df['item_price'] = df['item_price'].str.replace('$', '').astype(float)
df['item_price']

**Duplicated Entries:**
*  Use df.duplicated().sum() to identify duplicated entries.
*  Decide on an appropriate action, such as dropping duplicates using df.drop_duplicates().

This line removes the dollar sign from the 'Item Price' column and converts it to a float data type.

In [None]:
duplicated_entries = df.duplicated().sum()
df = df.drop_duplicates()


**Quantity and Item Price:**
*  Check for inconsistencies or anomalies in the Quantity and Item Price columns.
*  Handle any issues, such as converting data types or correcting values.

In [None]:
# Assuming Quantity and Item Price are numeric
df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce')


In [None]:
df['item_price'] = df['item_price'].fillna(df['item_price'].mean())


**Handling Special Characters:**
* Check for and address special characters in text-based columns.
* Use string manipulation functions like str.replace().

In [None]:
df['item_name'] = df['item_name'].str.replace('[^a-zA-Z0-9\s]', '')

**Order Id Integrity:**
* Cross-reference the Order ID column for irregularities or patterns.
* Ensure it follows the expected format or make necessary adjustments.

In [None]:
unique_order_ids = df['order_id'].unique()

**Item Name Standardization:**

* tandardize the Item Name column by unifying variations.
* Consider using string methods or mapping to create a standardized version.



In [None]:
df['Standardized Item Name'] = df['item_name'].str.lower()
df['Standardized Item Name']

**Quantity and Price Relationships:**
* Investigate relationships between Quantity and Item Price.
* Identify and handle any outliers or inconsistencies.

In [None]:
# Example: Check for negative prices or quantities
negative_values = df[(df['quantity'] < 0) | (df['item_price'] < 0)]


**Data Integrity Check:**

* Ensure quantities and prices align with corresponding items and descriptions.
* Cross-check against known values or patterns.

In [None]:
# Modified Original Code for Similar Output
integrity_check_modified = df.groupby(['item_name', 'choice_description']).agg({
    'quantity': ['sum', 'mean'],
    'item_price': ['sum', 'mean']
}).reset_index()


**Converting to CSV:**

* If needed, use df.to_csv() to convert the cleaned dataset to a CSV file.

In [None]:
df.to_csv('cleaned_chipotle.csv', index=False)


**Handling Categorical Data:**

* For categorical columns, consider encoding or transforming them using techniques like one-hot encoding.

In [None]:
# Example: One-hot encoding 'Item Name'
df_encoded = pd.get_dummies(df, columns=['item_name'], prefix='item')


**Consistent Quantity and Price Units:**

* Ensure consistency in units for Quantity and Item Price.
* Make conversions or adjustments if necessary.

In [None]:
# Example: Convert Quantity to integer
df['quantity'] = df['quantity'].astype(int)




---


**Final data cleanned**
---



In [None]:
# this will print the final cleanned data into tabular form
df