# Data Preprocessing Tools

## Importing Dataset

I will use the data previously scraped from Twitter (stored in my [Datasets Notion page](https://florentine-rayon-d99.notion.site/Datasets-88840ad9026047d09c0359327f39efd0)).

> 💡 See the **Obtaining Data** Notebook.

In [4]:
import numpy as np
import pandas as pd
import os

data_tweets = pd.read_csv("https://s3.us-west-2.amazonaws.com/secure.notion-static.com/2082ae9c-8b63-420c-9247-ebe388e3f92c/ML-AZ-tweets.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20230119%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20230119T063507Z&X-Amz-Expires=86400&X-Amz-Signature=5d41d0c0803a7609fc64800131bd9f477b576d22ee4918d881fa6e1242b3091e&X-Amz-SignedHeaders=host&response-content-disposition=filename%3D%22ML-AZ-tweets.csv%22&x-id=GetObject")

In [5]:
data_tweets.head()

Unnamed: 0,date,id,content,sentiment,has_media,views,retweets,replies,user,followers,likes
0,2023-01-19 02:55:22+00:00,1615905729654702083,Accurate.,0.4,False,3828.0,1,1,tunguz,90940,15
1,2023-01-19 02:21:00+00:00,1615897079741571072,We are so early. https://t.co/jKdSMlpPMS,0.1,False,3879.0,1,4,tunguz,90940,35
2,2023-01-19 00:01:10+00:00,1615861890739220481,Why cool and waste it when you can boil and ta...,0.075,True,5703.0,6,9,tunguz,90940,56
3,2023-01-18 23:46:44+00:00,1615858256471273472,I barely know what a binary tree is. Is that l...,0.05,False,8774.0,0,11,tunguz,90940,46
4,2023-01-18 23:43:05+00:00,1615857338254241792,"Yes, gaslighting is the right term here.",0.285714,False,4754.0,2,2,tunguz,90940,12


## Data Selection

To translate this dataset to a regression problem, I will use the:
- Tweet Sentiment
- If the Tweet contains media
- Number of Views
- User
- Number of Followers of the User

And I will see if I can predict the **Number of Likes** a Tweet may have. 

> In this notebook I won't use ML, I will just preprocess the data.

In [6]:
data_tweets_selected = data_tweets[["sentiment", "has_media", "views", "user", "followers", "likes"]]

In [8]:
X = data_tweets_selected.iloc[:, :-1].values
y = data_tweets_selected.iloc[:, -1].values

In [9]:
print(X)

[[0.4000000000000001 False 3828.0 'tunguz' 90940]
 [0.1 False 3879.0 'tunguz' 90940]
 [0.0749999999999999 True 5703.0 'tunguz' 90940]
 ...
 [0.3 False 24248.0 'svpino' 228565]
 [0.3333333333333333 True 40529.0 'svpino' 228565]
 [-0.25 True 52537.0 'svpino' 228565]]


In [10]:
print(y)

[   15    35    56    46    12    30    37    40    54   165    31    21
    37    83   343    52    54   104    13    69    26     5    22    41
    57   972    66    32    53    81    12    81    18    54    26    10
    23    16    16   152    40    40    75    44    34    50   137   119
  1774    68    10    27    40    20    72    75    65   148    66    11
    22   174    96     2     4    28    30     7    87    44    42    20
    88    13    83    25   134    19   134    39    25     8    52    53
    22     4    28    24    17    27    81    89   420   311    94    10
   398   181    43   464    69   223    20   120     1     0     2     1
     3     3     3     6     8     1    11     2    14     2     1     9
     2     5     1     2     1    10     1     0     0     1     0     2
     1     1     3     2     0     0     0     1     4     1     4     2
     2     2     3     0     2     1     0     0     1     2     0     0
   717  1273    62  6653  1150  2195  1314    73   

## Taking care of missing data

If there is any - for this example there is not. But I will do the exercise for number of views and sentiment.

In [15]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer() # Replace nan values with mean (default simple imputer parameters)
imputer.fit(X[:, [0,2]])

In [17]:
X[:, [0,2]] = imputer.transform(X[:, [0,2]])

In [18]:
print(X)

[[0.4000000000000001 False 3828.0 'tunguz' 90940.0]
 [0.1 False 3879.0 'tunguz' 90940.0]
 [0.0749999999999999 True 5703.0 'tunguz' 90940.0]
 ...
 [0.3 False 24248.0 'svpino' 228565.0]
 [0.3333333333333333 True 40529.0 'svpino' 228565.0]
 [-0.25 True 52537.0 'svpino' 228565.0]]


## Encoding categorical data

### Encoding the Independent Variable

The categorical data I have is **User** and **if the tweet contains media**(binary variable).

In [31]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers = [('encoder',OneHotEncoder(drop="first"),[1,3])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X)) # First columns will be the encoded columns

In [34]:
print(X)

[[0.0 0.0 0.0 ... 0.4000000000000001 3828.0 90940.0]
 [0.0 0.0 0.0 ... 0.1 3879.0 90940.0]
 [1.0 0.0 0.0 ... 0.0749999999999999 5703.0 90940.0]
 ...
 [0.0 0.0 1.0 ... 0.3 24248.0 228565.0]
 [1.0 0.0 1.0 ... 0.3333333333333333 40529.0 228565.0]
 [1.0 0.0 1.0 ... -0.25 52537.0 228565.0]]


### (not used) Encoding the Dependent Variable

We won't encode the dependent variable as we will be working with regression (it is numeric).

In [35]:
# from sklearn.preprocessing import LabelEncoder

# le = LabelEncoder()
# y = le.fit_transform(y)

## Splitting the dataset into the Training set and Test set

In [36]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

## Feature Scaling

We need to scale the **sentiment**, the **views**, and the **number of followers** (they became the last columns).

In [42]:
len(X[0])

7

In [43]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_test[:, 5:8] = scaler.fit_transform(X_test[:, 5:8])
X_train[:, 5:8] = scaler.transform(X_train[:, 5:8])

In [45]:
print(X_train[0])

[0.0 0.0 0.0 0.0 0.2121212121212121 -0.7109035951572351
 -1.0364130959482605]
