#Importing Warnings to filter up un necessary warning messages
#Also Importing:


1.   Pandas- Data Manipulation
2.   Numpy- Numerical Operations
3. re for regex operations
4. TfidfVectorizer- To Vectorize the Input
5. XGBClassifier
6. Models for Machine Learning and Prediction:


  *   Logistic Regression
  *   SVC
  * BernoulliNB





In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import re,string
from nltk.corpus import stopwords
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn import svm
from sklearn.metrics import accuracy_score
from scipy import stats
from sklearn.metrics import f1_score

##Data Cleaning

In [None]:
def strip_links(text):
    link_regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

##Removing Symbols and Stop Words

In [None]:
def strip_all_entities(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)

def stop_rev(text):
    result = []
    token = word_tokenize(text)
    stop = set(stopwords.words('english'))
    for num in token:
        if num not in stop and len(num) >= 3:
            result.append(num)
    return " ".join(result)

##Reading Input File

In [None]:
df = pd.read_excel('/content/BLOG GENDER BALANCED.xlsx')
print(df.head())

                                                BLOG GENDER
0   Beyond Getting There: What Travel Days Show U...      F
1  I remember so much about the island; the large...      F
2  I have had asthma and allergies my entire life...      M
3  The last few days have been an emotional rolle...      M
4  If you lined up all the teachers and staff in ...      F


In [None]:
df.columns

Index(['BLOG', 'GENDER'], dtype='object')

##Dropping NA fields

In [None]:
df.dropna(inplace=True)
df

Unnamed: 0,BLOG,GENDER
0,Beyond Getting There: What Travel Days Show U...,F
1,I remember so much about the island; the large...,F
2,I have had asthma and allergies my entire life...,M
3,The last few days have been an emotional rolle...,M
4,If you lined up all the teachers and staff in ...,F
...,...,...
2595,Activists help put an end to gross negligence...,F
2596,"I live to bash Al-Farouq Aminu, so bash him I ...",M
2597,so i havent posted anything in a couple of day...,M
2598,Hey. Things are going great down here in alab...,M


In [None]:
blogs = df["BLOG"]
gender = df["GENDER"]

##Cleaning Data

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
clean_news = []
for blog in blogs:
    clean_news.append(stop_rev(strip_all_entities(strip_links(' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",blog).split())))))

In [None]:
df_new = pd.DataFrame({"blogs":clean_news, "gender":gender})
df_new.head()

Unnamed: 0,blogs,gender
0,Beyond Getting There What Travel Days Show Tod...,F
1,remember much island large Lighthouse helped f...,F
2,asthma allergies entire life While bet many as...,M
3,The last days emotional rollercoaster team beh...,M
4,lined teachers staff school search might possi...,F


Old Input

In [None]:
print(df['BLOG'][0])
print(len(df['BLOG'][0]))

 Beyond Getting There: What Travel Days Show Us

Today’s guest post is by Gillian at One-Giant-Step.com sums up for me that imperceptible change that happens when you travel… you start appreciating things you never thought you would.  In that process, maybe you even learn a new way to see the world.



Who is it that said “It’s not about the destination, it’s about the journey”? Nine months of full time traveling has proven to me that this is absolutely true.

Before leaving on this trip the thought of an 8 or 10 hour bus trip was pretty daunting. The longest trips we’d taken were on planes, where they serve drinks and meals and we can pass the time watching movies. Eight hours on a bus, without the same amenities sounded like torture but we jumped in right from the start with a 22 hour ride from Lima to Cusco that, while not the most comfortable ride, got us into the swing of things pretty quickly.

Once we got a routine down…snacks packed, books prepared, podcasts ready…and had deter

New Cleaned Tokenized Input

In [None]:
print(df_new['blogs'][0])
print(len(df_new['blogs'][0]))

Beyond Getting There What Travel Days Show Today guest post Gillian One Giant Step com sums imperceptible change happens travel start appreciating things never thought would process maybe even learn new way see world Who said destination journey Nine months full time traveling proven absolutely true Before leaving trip thought hour bus trip pretty daunting The longest trips taken planes serve drinks meals pass time watching movies Eight hours bus without amenities sounded like torture jumped right start hour ride Lima Cusco comfortable ride got swing things pretty quickly Once got routine snacks packed books prepared podcasts ready determined favorite seats drivers side window bar blocking view children nearby bus journeys became easy travel days favorite days They chance quiet reflection leave behind past think place leaving start thinking remember experiences great evening market interesting people met cooking course horrible bed guesthouse They chance look forward anticipate coming 

Converting To Lower

In [None]:
df_new['blogs'] = df_new['blogs'].str.lower()

In [None]:
print(df_new['blogs'][0])
print(len(df_new['blogs'][0]))

beyond getting there what travel days show today guest post gillian one giant step com sums imperceptible change happens travel start appreciating things never thought would process maybe even learn new way see world who said destination journey nine months full time traveling proven absolutely true before leaving trip thought hour bus trip pretty daunting the longest trips taken planes serve drinks meals pass time watching movies eight hours bus without amenities sounded like torture jumped right start hour ride lima cusco comfortable ride got swing things pretty quickly once got routine snacks packed books prepared podcasts ready determined favorite seats drivers side window bar blocking view children nearby bus journeys became easy travel days favorite days they chance quiet reflection leave behind past think place leaving start thinking remember experiences great evening market interesting people met cooking course horrible bed guesthouse they chance look forward anticipate coming 

##Machine Learning Algorithm Begins Here:

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(df_new['blogs'], gender, test_size=0.1, random_state=42)

vectorizer = TfidfVectorizer()
x_train_vect = vectorizer.fit_transform(Xtrain)
x_test_vect = vectorizer.transform(Xtest)
x_train_vect.shape
x_test_vect.shape

(260, 45285)

In [None]:
def mod_runner(model, x_train_vect, Ytrain, Ytest):
  mod = model
  mod.fit(x_train_vect,Ytrain)
  best_preds = mod.predict(x_test_vect)
  print("Accuracy:{} for Model:{}".format(accuracy_score(best_preds,Ytest),model))


##Testing All Accuracy for models

In [None]:
models=[BernoulliNB(),svm.SVC(kernel='linear', C = 1.0),LogisticRegression(),svm.SVC(kernel='poly', C = 1.0),svm.SVC(kernel='sigmoid', C = 1.0)]

for model in models:
  mod_runner(model, x_train_vect, Ytrain, Ytest)

Accuracy:0.6846153846153846 for Model:BernoulliNB()
Accuracy:0.7038461538461539 for Model:SVC(kernel='linear')
Accuracy:0.7230769230769231 for Model:LogisticRegression()
Accuracy:0.6346153846153846 for Model:SVC(kernel='poly')
Accuracy:0.7 for Model:SVC(kernel='sigmoid')


##Selected Model-SVC with Sigmoid Function

Taking the 1st entry of the Test Set to classify and Predict.

In [None]:
mod = LogisticRegression()
mod.fit(x_train_vect, Ytrain)
best_prediction = mod.predict(x_test_vect[1])

The prediction result is below as "M" as in Male

In [None]:
print(best_prediction)

['M']


Converting the Test and Output Variables into DataFrames to view the Actual Output.

In [None]:
todf=pd.DataFrame(Xtest)
output = pd.DataFrame(Ytest)

The Actual Output for the same predicted input is also "M" as in Male

In [None]:
print(todf.head(1),output.head(1))

                                                  blogs
1593  came finally ere write testi bout col wer star...      GENDER
1593      M


#This proves our Prediction is Accurate!