**Install Dependencies**

In [1]:
!pip install kaggle



**Import Kaggle JSON File**

In [8]:
# configuring the path of Kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [2]:
!pip install kagglehub



In [9]:
!kaggle datasets download -d kazanova/sentiment140

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
  0% 0.00/80.9M [00:00<?, ?B/s]
100% 80.9M/80.9M [00:00<00:00, 1.21GB/s]


In [10]:
from zipfile import ZipFile
dataset = '/content/sentiment140.zip'

with ZipFile(dataset,'r') as zip:
  zip.extractall()
  print('The dataset is extracted')


The dataset is extracted


In [11]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [12]:
import nltk
nltk.download('stopwords')
remove_unwanted_words=stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
remove_unwanted_words.append('//')
#,'_','http','/','//'

In [13]:
column_names = ['target','id','date','flag','user','tweet_text']
raw_twitter_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', names=column_names, encoding='latin-1')

In [14]:
raw_twitter_data.shape

(1600000, 6)

In [15]:
raw_twitter_data.head()

Unnamed: 0,target,id,date,flag,user,tweet_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [16]:
raw_twitter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   target      1600000 non-null  int64 
 1   id          1600000 non-null  int64 
 2   date        1600000 non-null  object
 3   flag        1600000 non-null  object
 4   user        1600000 non-null  object
 5   tweet_text  1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [21]:
raw_twitter_data.isnull().sum()

Unnamed: 0,0
target,0
id,0
date,0
flag,0
user,0
tweet_text,0


In [20]:
raw_twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


In [22]:
raw_twitter_data['date'].value_counts().max()

20

In [23]:
raw_twitter_data.loc[raw_twitter_data['target'] == 4, 'target'] = 1

In [24]:
raw_twitter_data['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


In [25]:
port_stem = PorterStemmer()

def stemming_function(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',str(content))
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in remove_unwanted_words]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [26]:
raw_twitter_data['tweet_text'] = raw_twitter_data['tweet_text'].apply(stemming_function)

In [27]:
raw_twitter_data.head(5)

Unnamed: 0,target,id,date,flag,user,tweet_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,nationwideclass behav mad see


In [28]:
X=raw_twitter_data['tweet_text'].values

In [29]:
Y=raw_twitter_data['target'].values

In [None]:
X

array(['switchfoot twitpic com zl awww bummer shoulda got david carr third day',
       'upset updat facebook text might cri result school today also blah',
       'kenichan dive mani time ball manag save rest go bound', ...,
       'readi mojo makeov ask detail',
       'happi th birthday boo alll time tupac amaru shakur',
       'happi charitytuesday thenspcc sparkschar speakinguph h'],
      dtype=object)

In [None]:
Y

array([0, 0, 0, ..., 1, 1, 1])

Based on the preprocessing steps you've performed, your data is suitable for training various text classification models. Here are a few common and effective models you could consider:

*   **Logistic Regression**: A simple yet powerful linear model for binary classification. It's a good baseline model and can be surprisingly effective for text data.
*   **Naive Bayes**: Another probabilistic model that works well for text classification, especially with TF-IDF features. It's known for its simplicity and speed.
*   **Support Vector Machines (SVM)**: A versatile model that can handle complex relationships in the data. SVMs with different kernels (like linear or RBF) can be applied.
*   **Decision Trees/Random Forests**: Tree-based models that can capture non-linear patterns. Random Forests, an ensemble of decision trees, often provide better performance.
*   **Gradient Boosting Machines (like Gradient Boosting Classifier, XGBoost, LightGBM, CatBoost)**: These are powerful ensemble methods that iteratively build trees and can achieve high accuracy.
*   **Neural Networks (like Simple Feedforward Networks, CNNs, RNNs/LSTMs, Transformers)**: Deep learning models are very effective for complex text data. Simple feedforward networks can be used with TF-IDF features, while CNNs, RNNs/LSTMs, and Transformers are designed to work directly with word embeddings or sequences.

To choose the best model, you might want to try a few of these and evaluate their performance using metrics like accuracy, precision, recall, and F1-score.

In [30]:
# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, stratify=Y, random_state=2)

In [31]:
# Vectorize text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [None]:
print(Y_train)

[1 1 1 ... 1 1 0]


In [32]:
# Train Logistic Regression model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, Y_train)

In [33]:
# Make predictions and evaluate the model
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)
print('Accuracy score on training data : ', training_data_accuracy)

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
print('Accuracy score on test data : ', test_data_accuracy)

Accuracy score on training data :  0.7704678571428571
Accuracy score on test data :  0.7678708333333333


In [34]:
import pickle

filename = 'twitter_data_logistic_regression_trained_model.pkl'
pickle.dump(model, open(filename, 'wb'))

In [35]:
from sklearn.naive_bayes import MultinomialNB

# Train Naive Bayes model
naive_bayes_model = MultinomialNB()
naive_bayes_model.fit(X_train, Y_train)

In [36]:
# Make predictions and evaluate the model
X_train_prediction_nb = naive_bayes_model.predict(X_train)
training_data_accuracy_nb = accuracy_score(Y_train, X_train_prediction_nb)
print('Accuracy score on training data (Naive Bayes) : ', training_data_accuracy_nb)

X_test_prediction_nb = naive_bayes_model.predict(X_test)
test_data_accuracy_nb = accuracy_score(Y_test, X_test_prediction_nb)
print('Accuracy score on test data (Naive Bayes) : ', test_data_accuracy_nb)

Accuracy score on training data (Naive Bayes) :  0.7540830357142857
Accuracy score on test data (Naive Bayes) :  0.7521479166666667


In [38]:
import pickle

filename_nb = 'twitter_data_naive_bayes_model.pkl'
pickle.dump(naive_bayes_model, open(filename_nb, 'wb'))