**StackOverflow Post Outcome Prediction Tool**

train.csv is a dataset aquired from kaggle containing 45,000 StackOverflow posts, this dataset was acquired using an API to automate the process so there is no need to drop or fill missing data.



In [None]:
import pandas as pd
df = pd.read_csv('/content/sample_data/train.csv')
orig_df = df

In [None]:
print(orig_df)

             Id                                              Title  \
0      34552656             Java: Repeat Task Every Random Seconds   
1      34553034                  Why are Java Optionals immutable?   
2      34553174  Text Overlay Image with Darkened Opacity React...   
3      34553318         Why ternary operator in swift is so picky?   
4      34553755                 hide/show fab with scale animation   
...         ...                                                ...   
44995  60461435  Convert List<String> to string C# - asp.net - ...   
44996  60461754  Does Python execute code from the top or botto...   
44997  60462001               how to change payment date in Azure?   
44998  60465318        how to implement fill in the blank in Swift   
44999  60468018  How can I make a c# application outside of vis...   

                                                    Body  \
0      <p>I'm already familiar with repeating tasks e...   
1      <p>I'd like to understand why Ja

By using the loc function in python, it is possible to locate the ids of posts within the dataset, this may be used to add new features such as estimating the amount of 'votes' a question has by adding missing vote values to the dataset

In [None]:
ids = df['Id'].loc[1:3]
print(ids)

1    34553034
2    34553174
3    34553318
Name: Id, dtype: int64


In [None]:
from pandas.core.arrays.numeric import np
numpy_id = ids.to_numpy()
numpy_id = np.split(numpy_id, 3)
print(numpy_id)

[array([34553034]), array([34553174]), array([34553318])]


In [None]:
df.Body.value_counts(dropna=True)

<p>I'm already familiar with repeating tasks every n seconds by using Java.util.Timer and Java.util.TimerTask. But lets say I want to print "Hello World" to the console every random seconds from 1-5. Unfortunately I'm in a bit of a rush and don't have any code to show so far. Any help would be apriciated.  </p>\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

In [None]:
print(df)

             Id                                              Title  \
0      34552656             Java: Repeat Task Every Random Seconds   
1      34553034                  Why are Java Optionals immutable?   
2      34553174  Text Overlay Image with Darkened Opacity React...   
3      34553318         Why ternary operator in swift is so picky?   
4      34553755                 hide/show fab with scale animation   
...         ...                                                ...   
44995  60461435  Convert List<String> to string C# - asp.net - ...   
44996  60461754  Does Python execute code from the top or botto...   
44997  60462001               how to change payment date in Azure?   
44998  60465318        how to implement fill in the blank in Swift   
44999  60468018  How can I make a c# application outside of vis...   

                                                    Body  \
0      <p>I'm already familiar with repeating tasks e...   
1      <p>I'd like to understand why Ja

LQ_EDIT suggests that the post will only be accepted if adjustments are made to the question, this can be dropped in order to create a binary classification of HQ (High Quality) or LQ_CLOSE (where the question is low quality and users are unable to answer)

In [None]:
df.drop(df[df["Y"].str.contains("LQ_EDIT")].index)
df.Y = df.Y.map({'HQ': 1, 'LQ_CLOSE': 0})

The Tags and CreationDate columns are also not required although it should be noted that the tags that a question has may have some bearing on if a post is accepted or not

In [None]:
df = df.drop(
     columns=[
         "Tags",
         "CreationDate"
     ]
 )

In [None]:
print(df)

             Id                                              Title  \
0      34552656             Java: Repeat Task Every Random Seconds   
1      34553034                  Why are Java Optionals immutable?   
2      34553174  Text Overlay Image with Darkened Opacity React...   
3      34553318         Why ternary operator in swift is so picky?   
4      34553755                 hide/show fab with scale animation   
...         ...                                                ...   
44995  60461435  Convert List<String> to string C# - asp.net - ...   
44996  60461754  Does Python execute code from the top or botto...   
44997  60462001               how to change payment date in Azure?   
44998  60465318        how to implement fill in the blank in Swift   
44999  60468018  How can I make a c# application outside of vis...   

                                                    Body    Y  
0      <p>I'm already familiar with repeating tasks e...  0.0  
1      <p>I'd like to understan

The below code splits the train csv to seperate csvs based on rating value (Y), this may be used to for sentiment analysis following the code supplied by scikit-learn https://github.com/scikit-learn/scikit-learn/blob/main/doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py

In [None]:
from itertools import groupby
import csv

with open('/content/sample_data/train.csv') as csv_file:
    reader = csv.reader(csv_file)
    next(reader) #skip header

    #Group by column (ID)
    lst = sorted(reader, key=lambda x : x[5])
    groups = groupby(lst, key=lambda x : x[5])

    #Write file for each ID
    for k,g in groups:
        filename = k + '.csv'
        with open(filename, 'w', newline='') as fout:
            csv_output = csv.writer(fout)
            csv_output.writerow(["Id","Title","Body","Tags","CreationDate","Y"])  #header
            for line in g:
                csv_output.writerow(line)

stopwords are words that are deemed as insignificant for the processing of natural language, the code below downloads nltk's stopwords file and prepares the t variable to strip punctuation, stripping punctuation may have a negative impact on the accuracy of the model due to how punctuation is used differently in computer programming, the main focus of StackOverflow, and general language

In [None]:
import nltk
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import string, re

stemmer = SnowballStemmer('english')
t = str.maketrans(dict.fromkeys(string.punctuation))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


the clean text function removes the stopwords from the body of df (train.csv) as well as removes punctuation

In [None]:
def clean_text(text):
  text = text.translate(t)
  text = text.split()
  stops = set(stopwords.words("english"))
  text = [stemmer.stem(w) for w in text if not w in stops]

  text = " ".join(text)
  text = re.sub(' +',' ', text)
  return text

df["Body"] = df["Body"].apply(clean_text)

when processing language, it is important to condense the data that has been aqcuired, this may be achieved by using CountVectorizer, which splits a sentence in to an array with the index representing an individual word and the value stored being the frequency, following on from this, it is possible to store ngrams, which is a series of words that appear together frequently.

TF-IDF (or term frequency-inverse document frequency) is a score to represent the importance of strings (or in this instance, ngrams), TF refers to the number of times a word, string or ngram occurs and IDF acts as a measure of how common or uncommon the word, string or ngram is, the two values are then multiplied to give a TF-IDF score which may be used in other applications such as Search Engine Optimization.

the max_features parameter controls how many unique features are available for the vectorizer to use, this value is a large part of hypertuning, originally features was set to be 5000 although an increase to 7000 yielded a slight increase in accuracy using Naive Bayes, this value may be further tweaked to yield better results.

In [None]:
from sklearn import model_selection, preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['Body'], df['Y'])
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['Body'])
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
# word level tf-idf vectorization
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=7000) #originally 5000
tfidf_vect.fit(df['Body'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=7000)
tfidf_vect_ngram.fit(df['Body'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)

In [None]:
from sklearn import linear_model, naive_bayes, svm, metrics
from sklearn.metrics import classification_report

target_names = list(encoder.classes_) # output labels for report generation
def report_generation(classifier, train_data, valid_data, train_y, valid_y):
   classifier.fit(train_data, train_y)
   predictions = classifier.predict(valid_data)
   print("Accuracy :", metrics.accuracy_score(predictions, valid_y))
   report = classification_report(valid_y, predictions, output_dict=True, target_names=target_names)
   return report

The two segments of code below are two seperate algorithms that may be used in the processing of language, Naive Bayes and Support Vector Machines, SVM required too much time to generate it's results with a default of 5 epochs although the time taken was heavilly impacted by the number of features.

Therefore Naive Bayes was chosen as the algorithm for the machine learning model

In [None]:
# Naive Bayes
clf = naive_bayes.MultinomialNB()
report = report_generation(clf, xtrain_count, xvalid_count, train_y, valid_y)
print("NB Count Vectorizer Report :", report['weighted avg'])

Accuracy : 0.7927111111111111
NB Count Vectorizer Report : {'precision': 0.8164230909178949, 'recall': 0.7927111111111111, 'f1-score': 0.7959311004327146, 'support': 11250}


# Support Vector Machines
# takes too long, had very good results when just using xtrain_count and xvalid_count    
classifier = svm.SVC(gamma="scale")    
report = report_generation(classifier, xtrain_tfidf_ngram, xvalid_tfidf_ngram, train_y, valid_y)    
print("SVM Count Vectorizer Report :", report['weighted avg'])

We can then save the model we have trained as a .pkl file, this allows us to avoid having to train the data each time we wish to run our code.

despite a '2' value not being mapped, it still appears within the output of predicting with the pickled model, it can be assumed that this value may be mapped to LQ_EDIT

In [None]:
# Save the trained model as a pickle string.
saved_model = pickle.dump(clf, open('model.pkl', 'wb'))
import pickle
pickled_model = pickle.load(open('model.pkl', 'rb'))
predictions = naive_bayes.MultinomialNB()
xvalid_count = count_vect.transform(valid_x)
#pickled_model.fit(valid_x, valid_y)
xvalid_count = count_vect.transform(valid_x)
pickled_model.fit(xvalid_count, valid_y)
pickled_model.predict(xvalid_count)

array([1, 2, 1, ..., 2, 0, 1])

the code below downloads modules for use with flask in google colab as well as supplying ngrok with an authtoken so that the resulting webpage may be accessed, this token is accessible after accessing the flask web server and creating an ngrok account

In [None]:
!pip install flask-ngrok
!pip install utils
! pip install pyngrok
! ngrok authtoken TOKEN_GOES_HERE

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flask-ngrok
  Downloading flask_ngrok-0.0.25-py3-none-any.whl (3.1 kB)
Installing collected packages: flask-ngrok
Successfully installed flask-ngrok-0.0.25
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting utils
  Downloading utils-1.0.1-py2.py3-none-any.whl (21 kB)
Installing collected packages: utils
Successfully installed utils-1.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyngrok
  Downloading pyngrok-5.2.1.tar.gz (761 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m761.3/761.3 KB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyngrok
  Building wheel for pyngrok (setup.py) ... [?25l[?25hdone
  Created wheel for pyngrok: filename=py

the following code is primarily flask although it runs the train csv through data cleaning and processing again, it does not however retrain the model. By appending the user's input to the end of the csv, we can easily and relatively quickly access the predicted value of Y by simply looking at the last value given in the array of predicted values, this is then used to generate the results page which shows the user the predicted outcome of their question.



In [None]:
import json
import pickle
from flask import Flask, request, render_template
from flask_ngrok import run_with_ngrok
from pandas.core.arrays.numeric import np
from csv import writer
app = Flask(__name__)
run_with_ngrok(app)
model = pickle.load(open('model.pkl', 'rb'))

def ValuePredictor(to_predict):
    to_predict = np.array(to_predict).reshape(4, 1)
    loaded_model = pickle.load(open("model.pkl", "rb"))
    result = loaded_model.predict(to_predict)
    return result[0]

@app.route("/")
def home():
    return render_template('/index.html')
#To use the predict button in our web-app
@app.route('/result',methods=['POST'])
def result():
  if request.method == 'POST':

    to_predict_list = request.form.to_dict()
    to_predict = to_predict_list.get('body')
    print(to_predict_list)
    print(to_predict_list.get('body'))
    header = ['ID', 'Title', 'Body', 'Y']
    data = []
    data.append('1')
    data.append(to_predict_list.get('title'))
    data.append(to_predict_list.get('body'))
    data.append('<TAGS>')
    data.append('time')
    data.append('HQ')

    final_features = np.array(data)
    with open('/content/sample_data/train.csv', 'a', newline='') as f_object:
      # Pass the CSV  file object to the writer() function
      writer_object = writer(f_object)
      # Result - a writer object
      # Pass the data in the list as an argument into the writerow() function
      writer_object.writerow(data)
      # Close the file object
      f_object.close()
    df = pd.read_csv('/content/sample_data/train.csv')
    df.drop(df[df["Y"].str.contains("LQ_EDIT")].index)
    df.Y = df.Y.map({'HQ': 1, 'LQ_CLOSE': 0})
    df["Body"] = df["Body"].apply(clean_text)
    train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['Body'], df['Y'])
    encoder = preprocessing.LabelEncoder()
    train_y = encoder.fit_transform(train_y)
    valid_y = encoder.fit_transform(valid_y)
    count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
    count_vect.fit(df['Body'])
    xtrain_count = count_vect.transform(train_x)
    xvalid_count = count_vect.transform(valid_x)

    pickled_model = pickle.load(open('model.pkl', 'rb'))
    predictions = naive_bayes.MultinomialNB()
    xvalid_count = count_vect.transform(valid_x)
    #pickled_model.fit(valid_x, valid_y)
    xvalid_count = count_vect.transform(valid_x)
    pickled_model.fit(xvalid_count, valid_y)
    pickled_model.predict(xvalid_count)
    result = pickled_model.predict(xvalid_count)
    output = result[-1]



    if int(output)== 1:
      prediction ='Question likely to be rejected or closed'
      return render_template("result.html", prediction = prediction)
    elif int(output)== 0:
      prediction ='Question likely to be accepted'
      return render_template("result.html", prediction = prediction)
    else:
      prediction ='Question will be accepted with a few small changes'
      return render_template("result.html", prediction = prediction)


app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)


 * Running on http://0452-34-74-36-15.ngrok.io
 * Traffic stats available on http://127.0.0.1:4040


In regards to issues that occured during the creation of this project, SVM taking too long to justify training was a key issue and more algorithms should have been tested to determine what would yield the greatest accuracy.

Hypertuning could have been improved through the use of GridSearchCV or RandomizedSearchCV to automatically find the best set of hyperparameters although this would again take time which is limited in Google Colab.

One ethical concern would be in regards to StackOverflow users consenting to having their questions used in a dataset such as this, as this is secondary data consent forms were not required and as the posts are public for anyone to view this is likely a non issue but is something to consider should a project such as this be distributed for the general public.