# Data Mining – Classification
## Gabriel Marcelino, Eli Kaustinen

Then, write a comprehensive technical report as a markdown document, which includes all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) Problem statement, b) Algorithm of the solution, c) Analysis of the findings, and d) References.

## Part I: Data Mining Techniques

Explain each of the following data mining techniques in terms of how the algorithm works, its strengths, and weaknesses:

### Classification: 
Classification algorithms categorize data into predefined labels or categories (like spam or not spam). They learn patterns from labeled training data and use that knowledge to classify new data. Common methods include decision trees, support vector machines (SVM), and neural networks. 
- Strengths: High accuracy with well-labeled data, useful for spam detection, medical diagnosis, and sentiment analysis.
- Weaknesses: Performance drops with imbalanced or noisy data, and some models (e.g., deep learning) require significant computational resources.

### Prediction
Prediction models forecast future values based on historical data using regression techniques, time series analysis, or machine learning models.
- Strengths: Useful for financial forecasting, sales predictions, and demand planning; can handle complex patterns.
- Weaknesses: Accuracy depends on data quality and completeness, and it struggles with unpredictable external factors.

## Example of each data mining functionality using a real-life data set


In [19]:
import pandas as pd
import csv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression


# Classification: Spam/Not Spam
# read csv file emails.csv
with open('emails.csv', 'r') as file:
    reader = csv.reader(file)
    data = list(reader)

# Convert to DataFrame
df = pd.DataFrame(data[1:], columns=data[0])
df = df.dropna()
X = df['Message']
y = df['Label']

# Convert text to numerical features
vectorizer = CountVectorizer()
X_transformed = vectorizer.fit_transform(X)
X_transformed = X_transformed.toarray()

# Separate features and labels for training and testing
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=42)

# Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Evaluate the model
accuracy = model.score(X_test, y_test)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Prediction: House Price Prediction

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict & Evaluate
# Print column names
print(data.feature_names)
print("Features for last house:", X_test[-1])
print("Prediction for last house(in $100,000s):", model.predict([X_test[-1]]))
print("Actual price for last house (in $100,000s):", y_test[-1])


Model Accuracy: 98.84%
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
Features for last house: [ 3.55210000e+00  1.70000000e+01  3.98883929e+00  1.03348214e+00
  1.67100000e+03  3.72991071e+00  3.42200000e+01 -1.18370000e+02]
Prediction for last house(in $100,000s): [2.00940251]
Actual price for last house (in $100,000s): 1.515


## Part 2
Access the "UCI Machine Learning Repository," located in the topic Resources. Note: There are about 440 data sets that are suitable for use in a classification task. For this part of the exercise, you can choose one of these data sets, provided it includes at least 10 attributes and 10,000 instances. 
In class, we briefly discussed three classification methods: k-Nearest Neighbours (kNN), Support Vector Machine (SVM), and Decision Trees. For your selected data set, choose any two of the three classification methods and build a classifier based on each, as follows:

Pre-process the data.
Subset the data.
Split the data into training and testing sets
Build the classification model.
Run the model (make predictions).
Display classification results (quantitative and visual)
Provide the confusion matrix for each classifier.
For each classifier, compute the accuracy, sensitivity, and specificity.
Explain the use of the ROC curve and the meaning of the area under the ROC curve.
Compare the results obtained with each one of the classifiers, referring to the confusion matrix and associated metrics. Are the results similar? If not, how statistically different are they? If different, what is the reason? If similar, how would you decide to use one method or another?

## References

https://www.geeksforgeeks.org/data-mining-techniques/

