# Data Mining – Classification Part 1
**Authors**
### Eli Kaustinen and Gabriel Marcelino



## Part I: Data Mining Techniques

Explain each of the following data mining techniques in terms of how the algorithm works, its strengths, and weaknesses:

### Classification: 
Classification algorithms categorize data into predefined labels or categories (like spam or not spam). They learn patterns from labeled training data and use that knowledge to classify new data. Common methods include decision trees, support vector machines (SVM), and neural networks. 
- Strengths: High accuracy with well-labeled data, useful for spam detection, medical diagnosis, and sentiment analysis.
- Weaknesses: Performance drops with imbalanced or noisy data, and some models (e.g., deep learning) require significant computational resources.

### Prediction
Prediction models forecast future values based on historical data using regression techniques, time series analysis, or machine learning models.
- Strengths: Useful for financial forecasting, sales predictions, and demand planning; can handle complex patterns.
- Weaknesses: Accuracy depends on data quality and completeness, and it struggles with unpredictable external factors.

## Example of each data mining functionality using a real-life data set


In [4]:
import pandas as pd
import csv
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

# Part 1: Email Spam Classification
with open('emails.csv', 'r') as file:
    reader = csv.reader(file)
    data = list(reader)

df = pd.DataFrame(data[1:], columns=data[0])
df = df.dropna()
X = df['Message']
y = df['Label']

vectorizer = CountVectorizer()
X_transformed = vectorizer.fit_transform(X).toarray()

X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)

print("\n=== Spam Classification Results ===")
print(f"Model Accuracy: {accuracy * 100:.2f}%\n")

# Example prediction
example_email = X_test[0]
predicted_label = model.predict([example_email])[0]
print("Email content:", X.iloc[0])
print(f"Example email prediction: {predicted_label}")
print(f"Actual label: {y_test.iloc[0]}")

# Part 2: House Price Prediction
data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print("=== House Price Prediction Results ===")
print("Feature names:", ", ".join(data.feature_names))
print("\nLast house features:")
for name, value in zip(data.feature_names, X_test[-1]):
    print(f"{name}: {value:.4f}")
print(f"\nPredicted price: ${model.predict([X_test[-1]])[0]*100000:.2f}")
print(f"Actual price: ${y_test[-1]*100000:.2f}")


=== Spam Classification Results ===
Model Accuracy: 98.84%

Email content: Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you 

## Convlusion

Classification excels at categorizing data into discrete classes through methods like Naive Bayes, making it ideal for tasks like spam detection where clear categories exist. While powerful, classification can struggle with ambiguous cases and requires high-quality labeled training data. Prediction, as demonstrated through linear regression in housing price estimation, specializes in forecasting continuous numerical values by identifying underlying patterns and relationships in data. However, prediction models can be sensitive to outliers and may oversimplify complex relationships.

## References

https://www.geeksforgeeks.org/data-mining-techniques/

