## Data Cleaning: 

### Import Data: 

In [23]:
import pandas as pd

df_ads = pd.read_csv('ads_nonads.csv')

# SELECT cc_text, ad FROM ads_nonads
df_ads = df_ads[["cc_text", "ad"]]

  df_ads = pd.read_csv('ads_nonads.csv')


In [24]:
print(df_ads.head())
print(df_ads.shape)
print(df_ads.info())

                                             cc_text ad
0  cold-like symptoms, they will just give you an...  0
1  Thirty percent of the world's <span class="hig...  0
2  the last few years has been really shocking an...  0
3  in. when the tree is harvested and turned into...  0
4  >> as a teacher, the one that scared me the mo...  0
(167145, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167145 entries, 0 to 167144
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   cc_text  167143 non-null  object
 1   ad       167143 non-null  object
dtypes: object(2)
memory usage: 2.6+ MB
None


### Cleaning

In [25]:
# drop rows with any missing values
df_ads = df_ads.dropna()
print(df_ads.shape)
# drop duplicate rows
df_ads = df_ads.drop_duplicates()
print(df_ads.shape)
# drop rows where 'ad' is not 0 or 1
df_ads = df_ads[df_ads['ad'].isin([0, 1])]
print(df_ads.shape)

# Convert 'cc_text' column to string
df_ads['cc_text'] = df_ads['cc_text'].astype(str)

# Convert 'ad' column to integer
df_ads['ad'] = df_ads['ad'].astype(int)

(167141, 2)
(150667, 2)
(135566, 2)


In [27]:
# data check after cleaning
print(df_ads["ad"].value_counts())

ad
1    77795
0    57771
Name: count, dtype: int64


In [31]:
# print our the head of the data when ad is 1
print(df_ads[df_ads["ad"] == 1].head())

# print our the head of the data when ad is 0
print(df_ads[df_ads["ad"] == 0].head())

                                              cc_text  ad
11  I've lost count of how many asthma attacks I'v...   1
13  headache, and injection reactions. Ready for a...   1
14  ? One SodaStream bottle can save... i don't kn...   1
19  ? Hello, Sharks. My name is Tracy Rosensteel, ...   1
20  And Joe Biden's weakness makes it even worse. ...   1
                                             cc_text  ad
0  cold-like symptoms, they will just give you an...   0
1  Thirty percent of the world's <span class="hig...   0
2  the last few years has been really shocking an...   0
3  in. when the tree is harvested and turned into...   0
4  >> as a teacher, the one that scared me the mo...   0


#### This dataset is relatively balanced now, so we do text cleaning

In [32]:
df_ads["cc_text"][1]

'Thirty percent of the world\'s <span class="highlight">oceans</span> and and by twenty thirty introducing controls on invasive species and <span class="highlight">reducing</span> <span class="highlight">plastic</span> <span class="highlight">pollution</span> officials insist they can agree ambitious plans to transform our relationship with bio divest E. And ensure that by twenty fifty we obtained a shed vision of living in harmony with nature it all sounds good in theory doesn\'t it but just how realistic other goals lost here the world was stunned when united nations reported the world leaders had failed to meet a single bio diverse city target agreed unite she in twenty ten and I just can\'t be acceptable this time around the scientists claim that now humans are causing the six mass extinction event in the history of <span class="highlight">planet</span> and twenty twenty world economic forum business leaders said by divest you dos was the third biggest risk to the well in terms of 

In [33]:
df_ads["cc_text"][4]

'>> as a teacher, the one that scared me the most. affected their ability to learn. >> inexpensive and effective <span class="highlight">pesticide</span> considered essential by the industry. widely used on farms in 45 states to grow food, wheat every day. ingested the residue and food, <span class="highlight">water</span> <span class="highlight">contamination</span> and when it is sprayed and fields and trees and then carried in the wind. >> the draft that happens goes into their mouth and bodies. >> sarah says in many areas almost no buffer of protection between fields, schools and homes. >> the risk of <span class="highlight">pesticides</span> are too much.> studies show it can impact brain development in children. lower iqs, disabilities, disorders like adhd. >> very very <span class="highlight">toxic</span> <span class="highlight">chemical</span>. very concerned about it. >> contacted lobbying groups, written in support, as well as'

In [42]:
import re

def clean_text(text):
    #  Remove everything within HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove special characters
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)
    # Convert text to lowercase
    text = text.lower()
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text


In [44]:
sample_1 = clean_text(df_ads["cc_text"][1])
print(sample_1)

thirty percent of the worlds oceans and and by twenty thirty introducing controls on invasive species and reducing plastic pollution officials insist they can agree ambitious plans to transform our relationship with bio divest e and ensure that by twenty fifty we obtained a shed vision of living in harmony with nature it all sounds good in theory doesnt it but just how realistic other goals lost here the world was stunned when united nations reported the world leaders had failed to meet a single bio diverse city target agreed unite she in twenty ten and i just cant be acceptable this time around the scientists claim that now humans are causing the six mass extinction event in the history of planet and twenty twenty world economic forum business leaders said by divest you dos was the third biggest risk to the well in terms of likelihood and severity behind only climate change failure and weapons


In [45]:
sample_2 = clean_text(df_ads["cc_text"][4])
print(sample_2)

as a teacher the one that scared me the most affected their ability to learn inexpensive and effective pesticide considered essential by the industry widely used on farms in 45 states to grow food wheat every day ingested the residue and food water contamination and when it is sprayed and fields and trees and then carried in the wind the draft that happens goes into their mouth and bodies sarah says in many areas almost no buffer of protection between fields schools and homes the risk of pesticides are too much studies show it can impact brain development in children lower iqs disabilities disorders like adhd very very toxic chemical very concerned about it contacted lobbying groups written in support as well as


In [46]:
# Apply clean_text function to cc_text column
df_ads["cc_text"] = df_ads["cc_text"].apply(clean_text)

In [47]:
df_ads["cc_text"].head()

0    coldlike symptoms they will just give you an a...
1    thirty percent of the worlds oceans and and by...
2    the last few years has been really shocking an...
3    in when the tree is harvested and turned into ...
4    as a teacher the one that scared me the most a...
Name: cc_text, dtype: object

### Train-test split

In [48]:
from sklearn.model_selection import train_test_split

# Split the data into features and target
X = df_ads["cc_text"]
y = df_ads["ad"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting datasets to verify
print("Training set size:", X_train.shape, y_train.shape)
print("Testing set size:", X_test.shape, y_test.shape)

Training set size: (108452,) (108452,)
Testing set size: (27114,) (27114,)


### Now you can use training dataset to build your model and text dataset to test the model performance. If you like an additional validation dataset, you can further split the training dataset. 