# Assignment 1

# Instructions
In Scikit-learn documentation you'll find a large list of ways to classify data. Do a little scavenger hunt in these docs: your goals is to look for classification methods and match a dataset in this curriculum, a question you can ask of it, and a technique of classification. Create a spreadsheet or table in a .doc file and explain how the dataset would work with the classification algorithm.

# Classification Models and type of Dataset


| Dataset              | Question (Target Variable)                                      | Classification Method | Why It Works                                                                 |
|----------------------|-----------------------------------------------------------------|------------------------|-------------------------------------------------------------------------------|
| Heart Disease        | How likely will a patient have heart disease? (Yes/No)          | Logistic Regression    | Binary classification: predicts probability of disease (0 = No, 1 = Yes). Interpretable and widely used in healthcare. |
| Salary Dataset       | Predict if a person’s salary is “High” or “Low” based on experience | Support Vector Machine (SVM) | Converts salary into categories (High/Low). SVM finds the best decision boundary. Works well with small datasets. |
| Pumpkin Dataset      | Predict if the average price in a given month is “Above Average” or “Below Average” | Random Forest         | Handles non-linear seasonal data. Robust, reduces overfitting compared to a single decision tree. |
| Cuisine Dataset      | Predict cuisine type according to ingredients (Multiclass)      | Naive Bayes            | Works well for text data like ingredients. Simple, fast, and effective for multiclass classification. |


# Assignment 2

## Study the solvers
### Instructions
In this lesson you learned about the various solvers that pair algorithms with a machine learning process to create an accurate model. Walk through the solvers listed in the lesson and pick two. In your own words, compare and contrast these two solvers. What kind of problem do they address? How do they work with various data structures? Why would you pick one over another?

## Answer

- Limited memory broyden fletcher goldfarb shanno(LBFGS) is a btch solver commonly used for optimazation problems in MAchine learnng. It is Efficient for problems with large number of parameyters but may not be suitable for large datasets due to memory constrain. It handles multinomial classification and only supports L2 regularization.

- Liblinear uses a coordinate gradient descent algorithm which is good for small dataset. it handles one vs rest classification and supports both L1 (Lasso) and L2(Ridge) regularization. It specializes in linear model making it highly efficent for dataset where linear seperatability is sufficient.

# Assignment 3

## Parameter Play
### Instructions
- There are a lot of parameters that are set by default when working with these classifiers. Intellisense in VS Code can help you dig into them. Adopt one of the ML Classification Techniques in this lesson and retrain models tweaking various parameter values. Build a notebook explaining why some changes help the model quality while others degrade it. Be detailed in your answer.

## Solution

In [2]:
# starting with KNN and some general parameters
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

Parameters: 
- General:
    - Random state: ensures reproducibility (same split, same results each run).
    - n_jobs: number of CPU cores used; -1 means use all cores (faster training).
  
- RandomForest DecisionTrees
  - n_estimators: number of trees in the forest. More trees → better accuracy (up to a point), but slower.
  - max_depth: maximum depth of each tree.
        - Small depth = prevents overfitting (but may underfit)
        - Large depth = captures complexity but may overfit.

- LogisticRegression
  - solver – algorithm to use for optimization
      - "liblinear" → good for small datasets, binary classification.
      - "saga" → handles large datasets, works with L1/L2 penalties.
      - "newton-cg", "lbfgs" → handle multinomial (multi-class).
  - multi_class:
      - ovr
      - multinomial
  - penalty: Type of regularization
      - L1: feature selection (some coefficients shrink to 0).
      - L2: prevents overfitting by penalizing large coefficients.
      - elasticnet: mix of both
  - c: inverse of regularization strength.



In [5]:
# import dataset
df = pd.read_csv('cleaned_cuisine.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,asparagus,avocado,bacon,baked_potato,balm,banana,barley,bartlett_pear,basil,bay,bean,beech,beef,beef_broth,beef_liver,beer,beet,bell_pepper,bergamot,berry,bitter_orange,black_bean,black_currant,black_mustard_seed_oil,black_pepper,black_raspberry,black_sesame_seed,black_tea,blackberry,...,sweet_potato,swiss_cheese,tabasco_pepper,tamarind,tangerine,tarragon,tea,tequila,thai_pepper,thyme,tomato,tomato_juice,truffle,tuna,turkey,turmeric,turnip,vanilla,veal,vegetable,vegetable_oil,vinegar,violet,walnut,wasabi,watercress,watermelon,wheat,wheat_bread,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini,cuisine
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,indian
1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,indian
2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,indian
3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,indian
4,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,indian


In [9]:
#train Logistic regression model without any parameter

X = df.drop(['Unnamed: 0', 'cuisine'], axis = True)
y = df['cuisine']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 42)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

print(f'Accuracy:{accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

Accuracy:0.7909887359198998
              precision    recall  f1-score   support

     chinese       0.69      0.66      0.68       151
      indian       0.93      0.87      0.90       165
    japanese       0.73      0.82      0.77       149
      korean       0.86      0.76      0.81       164
        thai       0.75      0.83      0.79       170

    accuracy                           0.79       799
   macro avg       0.79      0.79      0.79       799
weighted avg       0.80      0.79      0.79       799



In [None]:
#train Logistic regression model with  parameters

X = df.drop(['Unnamed: 0', 'cuisine'], axis = True)
y = df['cuisine']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 42)
log_reg_p = LogisticRegression(solver = 'liblinear', penalty = 'l1', multi_class='ovr')  # seems like the lgbs solver is better
log_reg_p.fit(X_train, y_train)
y_pred = log_reg_p.predict(X_test)

print(f'Accuracy:{accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

Accuracy:0.7847309136420526
              precision    recall  f1-score   support

     chinese       0.68      0.68      0.68       151
      indian       0.93      0.87      0.90       165
    japanese       0.72      0.80      0.76       149
      korean       0.84      0.74      0.79       164
        thai       0.77      0.82      0.80       170

    accuracy                           0.78       799
   macro avg       0.79      0.78      0.78       799
weighted avg       0.79      0.78      0.79       799





In [22]:
# classifiers without parameters
rand_forest = RandomForestClassifier()

rand_forest.fit(X_train, y_train)
fy_pred = rand_forest.predict(X_test)
print(f'Accuracy RFC:{accuracy_score(y_test, fy_pred)}')
print(classification_report(y_test, fy_pred))

Accuracy RFC:0.8397997496871089
              precision    recall  f1-score   support

     chinese       0.76      0.80      0.78       151
      indian       0.91      0.93      0.92       165
    japanese       0.83      0.88      0.86       149
      korean       0.90      0.73      0.80       164
        thai       0.80      0.86      0.83       170

    accuracy                           0.84       799
   macro avg       0.84      0.84      0.84       799
weighted avg       0.84      0.84      0.84       799



In [None]:
# classifiers with parameters
RFC_Par = RandomForestClassifier(n_estimators = 50, max_depth = 10) 
 #more estimator better performance, more depth better performance thats why we have less accuracy.
 # the dafualt is better exahusting all possible ways

RFC_Par.fit(X_train, y_train)
fy_pred_p = RFC_Par.predict(X_test)
print(f'Accuracy RFC:{accuracy_score(y_test, fy_pred_p)}')
print(classification_report(y_test, fy_pred_p))

Accuracy RFC:0.7284105131414268
              precision    recall  f1-score   support

     chinese       0.65      0.68      0.66       151
      indian       0.85      0.87      0.86       165
    japanese       0.55      0.85      0.66       149
      korean       0.85      0.64      0.73       164
        thai       0.88      0.62      0.72       170

    accuracy                           0.73       799
   macro avg       0.76      0.73      0.73       799
weighted avg       0.76      0.73      0.73       799

