This is the template for the image recognition exercise. <Br>
Some **general instructions**, read these carefully:
 - The final assignment is returned as a clear and understandable *report*
    - define shortly the concepts and explain the phases you use
    - use the Markdown feature of the notebook for larger explanations
 - return your output as a *working* Jupyter notebook
 - name your file as Exercise_MLPR2023_Partx_uuid.jpynb
    - use the uuid code determined below
    - use this same code for each part of the assignment
 - write easily readable code with comments     
     - if you exploit code from web, provide a reference
 - it is ok to discuss with a friend about the assignment. But it is not ok to copy someone's work. Everyone should submit their own implementation
     - in case of identical submissions, both submissions are failed 

**Deadlines:**
- Part 1: Mon 6.2 at 23:59**
- Part 2: Mon 20.2 at 23:59**
- Part 3: Mon 6.3 at 23:59**

**No extensions for the deadlines** <br>
- after each deadline, example results are given, and it is not possible to submit anymore

**If you encounter problems, Google first and if you can’t find an answer, ask for help**
- Moodle area for questions
- pekavir@utu.fi
- teacher available for questions on Mondays 30.1, 13.2 (after lecture) and Thursday 2.3 (at lecture)

**Grading**

The exercise covers a part of the grading in this course. The course exam has 5 questions, 6 points of each. Exercise gives 6 points, i.e. the total score is 36 points.

From the template below, you can see how many exercise points can be acquired from each task. Exam points are given according to the table below: <br>
<br>
7 exercise points: 1 exam point <br>
8 exercise points: 2 exam points <br>
9 exercise points: 3 exam points <br>
10 exercise points: 4 exam points <br>
11 exercise points: 5 exam points <br>
12 exercise points: 6 exam points <br>
<br>
To pass the exercise, you need at least 7 exercise points, and at least 1 exercise point from each Part.
    
Each student will grade one submission from a peer and their own submission. After each Part deadline, example results are given. Study them carefully and perform the grading according to the given instructions. Mean value from the peer grading and self-grading is used for the final points. 

In [1]:
import uuid
# Run this cell only once and save the code. Use the same id code for each Part.
# Printing random id using uuid1()
print ("The id code is: ",end="")
print (uuid.uuid1())

The id code is: 4d05cbe3-bb52-11ed-876e-34f64b772b02


# Introduction (1 p)

Write an introductory chapter for your report
<br>
- Explain what is the purpose of this task?
- Describe, what kind of data were used? Where did it originate? Give correct reference.
- Which methods did you use?
- Describe shortly the results

# Introduction

The purpose of this task was to analyze a dataset in order to find insights that could help with decision-making. In other words the task was to extract meaningful information from the dataset and try to present it in a clear and understandable manner.

The data used in this task was obtained from https://www.muratkoklu.com/datasets/vtdhnd09.php, which consists of 75 000 thousands pictures of 5 different types of rice, in our task we only used 300 pictures of 3 different types of rice(100pictures/ricetype).

Multiple methods were used including data exploration, visualization, and statistical analysis.

Overall the result show how well different models are able to predict the datapoints to the correct ricetypes and what are the most important features of the ricetypes in order for use to predict this information.

# Part 2

Data exploration and model selection

# Part 3

## Performance estimation (2 p)

Use the previously gathered data (again, use the standardized features). <br>
Estimate the performance of each model using nested cross validation. Use 10-fold cross validation for outer and <br>
5-fold repeated cross validation with 3 repetitions for inner loop.  <br> 
Select the best model in the inner loop using the hyperparameter combinations and ranges defined in the Part 2. <br>
For each model, calculate the accuracy and the confusion matrix. <br> 
Which hyperparameter/hyperparameter combination is most often chosen as the best one for each classifier? 

## Discussion (2 p)

Discuss you results

- Which model performs the best? Why?
- Ponder the limitations and generalization of the models. How well will the classifiers perform for data outside this data set?
- Compare your results with the original article. Are they comparable?
- Ponder applications for these type of models (classifying rice or other plant species), who could benefit from them? Ponder also what would be interesting to study more on this area?
- What did you learn? What was difficult? Could you improve your own working process in some way?

In [2]:
#importing all the packages needed
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from IPython.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import RepeatedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score, RepeatedKFold, cross_val_predict
from sklearn.metrics import accuracy_score, confusion_matrix
from collections import Counter

In [3]:
#this is from part2 I hadn't saved the standardized data that's why it's done again here
#taking the data and standardizing it
df = pd.read_csv('training_data/rice_feature_data.csv')
feats = ['mean_b', 'var_b', 'skew_b', 'kurt_b', 'entr_b', 'mean_g', 'var_g',
       'skew_g', 'kurt_g', 'entr_g', 'mean_r', 'var_r', 'skew_r', 'kurt_r',
       'entr_r', 'major_axis_length', 'minor_axis_length', 'area', 'perimeter',
       'roundness', 'aspect_ratio']
for feat in feats:
    df['{}_Z'.format(feat)] = (df[feat] - df[feat].mean()) / df[feat].std()

feats_Z = [feat + '_Z' for feat in feats]

y = df['class'].values
X = df[feats_Z].values

In [4]:
# Define models and their respective hyperparameter search ranges
models = {
    'KNN': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(random_state=20),
    'MLP': MLPClassifier(max_iter=500, early_stopping=True, random_state=20)
}

params = {
    'KNN': {'n_neighbors': range(1, 30)},
    'Random Forest': {'max_depth': [2, 4, 6, 8, 10, 12], 'max_features': [2, 3, 4, 5, 6, 7, 8]},
    'MLP': {'hidden_layer_sizes': range(3,22),
            'activation': ['logistic', 'relu'],
            'solver': ['sgd', 'adam'],
            'validation_fraction': [0.1,0.5]}
}

# Perform nested cross-validation for each model
outer_kf = KFold(n_splits=10, shuffle=True, random_state=10)
inner_kf = RepeatedKFold(n_splits=5, n_repeats=3, random_state=50)

for name, model in models.items():
    # Define parameter grid for this model
    param_grid = params[name]

    # Perform grid search for best hyperparameters
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=inner_kf)
    grid_search.fit(X, y)

    # Use best hyperparameters to evaluate model
    best_model = grid_search.best_estimator_
    accuracy = cross_val_score(best_model, X=X, y=y, cv=outer_kf, scoring='accuracy')
    y_pred = cross_val_predict(best_model, X=X, y=y, cv=outer_kf)
    cm = confusion_matrix(y, y_pred)
    
    # Print results for this model
    print(f"{name} - Best Params: {grid_search.best_params_}")
    print(f"{name} - Accuracy: {accuracy.mean()}")
    print(f"{name} - Confusion Matrix:\n{cm}")
    
    # Determine the most frequent hyperparameters for this model
    param_results = [(tuple(params.items()), mean_test_score) for params, mean_test_score in zip(grid_search.cv_results_['params'], grid_search.cv_results_['mean_test_score'])]
    freq_params = Counter(param_results)
    most_freq_params = freq_params.most_common(1)[0]
    print(f"{name} - Most frequent hyperparameters: {most_freq_params[0]} with score {most_freq_params[1]}")


KNN - Best Params: {'n_neighbors': 9}
KNN - Accuracy: 0.9800000000000001
KNN - Confusion Matrix:
[[99  0  1]
 [ 0 99  1]
 [ 2  2 96]]
KNN - Most frequent hyperparameters: ((('n_neighbors', 1),), 0.9744444444444443) with score 1
Random Forest - Best Params: {'max_depth': 2, 'max_features': 3}
Random Forest - Accuracy: 0.99
Random Forest - Confusion Matrix:
[[100   0   0]
 [  0  99   1]
 [  1   1  98]]
Random Forest - Most frequent hyperparameters: ((('max_depth', 2), ('max_features', 2)), 0.9855555555555554) with score 1
MLP - Best Params: {'activation': 'relu', 'hidden_layer_sizes': 19, 'solver': 'adam', 'validation_fraction': 0.5}
MLP - Accuracy: 0.9733333333333334
MLP - Confusion Matrix:
[[100   0   0]
 [  3  95   2]
 [  3   0  97]]
MLP - Most frequent hyperparameters: ((('activation', 'logistic'), ('hidden_layer_sizes', 3), ('solver', 'sgd'), ('validation_fraction', 0.1)), 0.2733333333333333) with score 1


Based on the results(accuracy scores) the Random Forest-model performed the best out of the three with the score of 0.99, KNN followed it closely with accuracy of 0.98, and MLP with accaracy of 0.973. The score with 1 represents how many times to combiantion/parameter was used, but I really didn't understand should it have been used more than once?... maybe I shouldn't have printed with score 1?

When considering the limitations and generalization of these models we have to take into account that these models have been trained for a specific dataset which was a really small portion of the orignal dataset. And that's why their performance might not be as good when they would be used for other datasets of the full dataset.

Our results are comparable to the article in a sense that the Random Forest model achieved the highest accuracy on both(mine and the article). There were differences in specific hyperparamaters and accuracy scores reportted though.

The models could be used in applications that need to classify different speceies of rice or other plants, which can lead to more effecitien crop managment for example. Addition further study in this area should consist maybe of other models.

Overall, I learned the process of tuning and evaluating machine learning models, and how difficult it is and that you have to always keep in mind the limitations and generalizations of these models. And for sure I could improve my working progress by researching more and doing more work, just like in many of these data-analysis courses it reads that only way to learn data-analysis is by doing data-analysis. 