# NLP Project - Social Media Text Classification

## Project Overview
This project involves classifying social media texts into 'Relevant' and 'Not Relevant' categories using Natural Language Processing (NLP) and Machine Learning techniques.
## Projeye Genel Bakış
Bu proje, sosyal medya metinlerinin Doğal Dil İşleme (NLP) ve Makine Öğrenimi tekniklerini kullanarak 'İlgili' ve 'İlgili Değil' kategorilerine göre sınıflandırılmasını içermektedir.


## Data Loading and Preparation

In [83]:
import pandas as pd
import string
import nltk
import re  

from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# NLTK kaynaklarını indirme
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [84]:
df = pd.read_csv('/kaggle/input/nlp-starter-test/socialmedia_relevant_cols.csv',encoding='latin1')

In [85]:
df.head(8)

Unnamed: 0,text,choose_one,class_label
0,Just happened a terrible car crash,Relevant,1
1,Our Deeds are the Reason of this #earthquake M...,Relevant,1
2,"Heard about #earthquake is different cities, s...",Relevant,1
3,"there is a forest fire at spot pond, geese are...",Relevant,1
4,Forest fire near La Ronge Sask. Canada,Relevant,1
5,All residents asked to 'shelter in place' are ...,Relevant,1
6,"13,000 people receive #wildfires evacuation or...",Relevant,1
7,Just got sent this photo from Ruby #Alaska as ...,Relevant,1


In [86]:
df = df.drop("choose_one",axis=1) 

## Data Cleaning and Preprocessing

In [87]:
# Metni küçük harfe dönüştürme
df['text'] = df['text'].str.lower() 

df['text'] = df['text'].str.replace(r'https?://\S+|www\.\S+', '', regex=True)

# HTML taglarını kaldırma
df['text'] = df['text'].str.replace(r'<.*?>', '', regex=True)

# Noktalama ve numaraları kaldırma
df['text'] = df['text'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Fazladan boşlukları kaldırma
df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True).str.strip()

In [88]:
x = df['text']
y = df['class_label']

In [89]:
x[60]

'psa i m splitting my personalities techies follow ablazeco burners follow ablaze'

In [90]:
x[99]

'accident cleared in paturnpike on patp eb between pa and cranberry slow back to traffic'

In [91]:
# İngilizce stop words listesi
stop_words = set(stopwords.words('english'))

# Vektorize Etme ve stop words çıkartma
vect = CountVectorizer(ngram_range=(1,2),stop_words='english')

x=vect.fit_transform(x)

In [92]:
x

<10876x80465 sparse matrix of type '<class 'numpy.int64'>'
	with 171400 stored elements in Compressed Sparse Row format>

In [93]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# Initializing the RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Training the classifier
rf_classifier.fit(X_train, y_train)

# Predicting on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_results = classification_report(y_test, y_pred)

print("Accuracy: ",accuracy)
print("Clasification Report: ",classification_report_results)

Accuracy:  0.8001838798651547
Clasification Report:                precision    recall  f1-score   support

           0       0.78      0.90      0.84      1857
           1       0.84      0.67      0.74      1401
           2       0.00      0.00      0.00         5

    accuracy                           0.80      3263
   macro avg       0.54      0.52      0.53      3263
weighted avg       0.81      0.80      0.80      3263

