# **Part 1**

In [10]:
import pandas as pd

# Load the dataset
df = pd.read_csv('dataset_task1.csv')

# Convert 'age' to numeric, forcing errors to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Cap 'age' between 0 and 100
df['age'] = df['age'].clip(lower=0, upper=100)

# Standardize city and name columns by converting to title case
df['name'] = df['name'].str.title()
df['city'] = df['city'].str.title()

# Standardize string lengths
df['city'] = df['city'].str[:15]
df['name'] = df['name'].str[:15]

# Convert 'date' to a consistent format and handle missing dates
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Fill missing dates with a constant or appropriate method
df['date'].fillna('2021-06-15', inplace=True)

# Fill missing values in 'income' and 'temperature_F' with median
df['income'].fillna(df['income'].median(), inplace=True)
df['temperature_F'].fillna(df['temperature_F'].median(), inplace=True)

# Convert temperature from Fahrenheit to Celsius
df['temperature_C'] = (df['temperature_F'] - 32) * 5.0 / 9.0

# Fill missing values in 'age', 'temperature_C', and other columns if necessary
df['age'].fillna(df['age'].median(), inplace=True)
df['temperature_C'].fillna(df['temperature_C'].median(), inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# View the final dataset
df.head()


Unnamed: 0,name,age,city,temperature_F,income,date,temperature_C
0,Alice,25.0,New York,95.0,55000.0,2021-06-15,35.0
1,Bob,35.0,Los Angeles,102.0,60000.0,2021-06-15,38.888889
2,Carol,35.0,New York,89.0,57500.0,2021-06-15,31.666667
3,Bob,40.0,Chicago,78.0,70000.0,2021-06-15,25.555556
4,Alice,100.0,Chicago,84.0,45000.0,2021-06-15,28.888889


**Task 1: Data Type Constraints**






To convert columns with incorrect data types (such as strings like
"thirty" for age) to numeric, I used the pd.to_numeric() function. This function converts a column to a numeric type, and with the errors='coerce' argument, it forces invalid values (e.g., strings or inappropriate entries) into NaN.

**Task 2: Data Range Constraints**

To detect and handle outliers, I used the clip() function to cap the values within a specific range.

**Task 3: Uniqueness Constraints**

To remove duplicate rows, I used the drop_duplicates() function.

**Task 4: Categorical Data Problems**

For inconsistent text in the categorical data (such as different capitalization for city names), I used the str.title() method to standardize them.

**Task 5: Numeric Data Problems**





Celsius=(Fahrenheit−32)× 5/9

**Task 6: Date Constraints**

To ensure consistency in the date column, I used the pd.to_datetime() function. This function converts the column into a standard date format (YYYY-MM-DD).

**Task 7: Missing Data**

To handle missing values in numerical columns (age, income, temperature_F, temperature_C), I used the fillna() method. For each column, I imputed missing values with the median of that column, ensuring that the missing data is replaced with a statistically meaningful value.

# **Part 2**

In [12]:
# Setting library
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier

# Load the dataset from Excel
file_path = 'Raisin_Dataset.xlsx'
df = pd.read_excel(file_path)

df.head()

# Prepare Features (X) and Target (y)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Encode the target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Random Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Split the dataset randomly into 70% train and 30% test

# Stratified Sampling
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Cross-Validation

# Initialize the model
clf = RandomForestClassifier(random_state=42)

# Apply 5-fold cross-validation
cv = StratifiedKFold(n_splits=5)
cv_scores = cross_val_score(clf, X, y, cv=cv)

# Print cross-validation results
print("Cross-Validation Scores: ", cv_scores)
print("Average CV Score: ", cv_scores.mean())


Cross-Validation Scores:  [0.88333333 0.9        0.85555556 0.81111111 0.85      ]
Average CV Score:  0.86


**Brief explanation of the code above**

We used the train_test_split() function from sklearn to randomly split the dataset into 70% training and 30% test sets. The random_state=42 ensures reproducibility. We used the stratify argument in train_test_split() to ensure that the class distribution in the training and test sets matches the overall distribution. For cross-validation, we used the cross_val_score() function with StratifiedKFold to ensure that each fold has a similar class distribution. The RandomForestClassifier model is trained on different folds, and we report the accuracy for each fold and the average accuracy.

# Link for the dataset: https://archive.ics.uci.edu/dataset/850/raisin