<a href="https://colab.research.google.com/github/charisliao/crisis-text-line-detection/blob/master/CrisisTextLineDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hello! Welcome to Crsis Text Line Detection Notebook.

### Author: Charis Liao

I aim to build an AI/data system to help identify crisis text lines in the hope the spread mental health awareness. The input text will be ranked between 0 and 1. The closer the text is to 0, the safer (positive) it is , and 1 otherwise.

This notebook handles the back-end AI modeling for Anvil (the front-end interface for crisis text line detection).

For the project, I will:

- Create a working database table that stores user input from user interface using Anvil.     
- create a working GBDT model     
  - Pre-process data (vectorize text to matrix of token counts, model, then post-process data (vectorize text to matrix of token counts; parse model result to pass to Anvil)
- Connect Colab notebook and Anvil app via UpLink

In [1]:
## Load the required modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import xgboost

from scipy.sparse import hstack, vstack # use hstack to concatenate features horizontally

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.multioutput import MultiOutputClassifier
from sklearn.preprocessing import LabelEncoder

# set option below so Pandas dataframe can output readable text, not truncated
pd.set_option('display.max_colwidth', 0)

# Install the latest version of gdown
!pip install --upgrade --no-cache-dir gdown

# download the file
#!gdown --id 1-3nuUR2kRu7OIIEw2Q_tQ1BGOSZRkkNt
!gdown --id 1lBXN_DLCnWpTOHljHPoPwY_bt3WD2HsR


Collecting gdown
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.6.6
    Uninstalling gdown-4.6.6:
      Successfully uninstalled gdown-4.6.6
Successfully installed gdown-4.7.1
Downloading...
From: https://drive.google.com/uc?id=1lBXN_DLCnWpTOHljHPoPwY_bt3WD2HsR
To: /content/Tweets.csv
100% 473k/473k [00:00<00:00, 5.09MB/s]


In [2]:
# read datasets
train = pd.read_csv('https://raw.githubusercontent.com/tony51307/datax-gsi/main/train.csv', on_bad_lines='skip',encoding='unicode_escape')
test = pd.read_csv('https://raw.githubusercontent.com/tony51307/datax-gsi/main/test.csv', on_bad_lines='skip')
train.dropna(inplace = True)
x_train, y_train, x_test, y_test = train['text'], train['label'], test['text'], test['label']

In [3]:
# preprocess the data
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)




def first_preprocessor(s):
   #convert to lowercase (which CountVectorizer and TfidfVectorizer do by default)
    s = s.lower()

    # replace "&amp" with "and"
    s = s.replace("&amp", "and")

    # remove select punctation; re refers to the Regular Expression module
    s = re.sub("[@,.!?:;/~*]", " ", s)

    # replace multiple consecutive blank spaces with 1 blank space?
    s = re.sub("[ ]+", " ", s)

    # remove all numbers?
    s = re.sub(r'\d+', '', s)

    return s

# first_preprocessor("10 CONVERT    UPPERCASE TO LOWERCASE AND REMOVE SELECT PUNCUATION?")

ngram_range  = (1,np.random.randint(1, 3))
stop_words   = np.random.choice([None, "english"])

# create an instance of CountVectorizer or TfidfVectorizer
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
vectorizer   = np.random.choice([CountVectorizer(preprocessor=first_preprocessor,
                                                 ngram_range=ngram_range,
                                                 stop_words=stop_words,
                                                 max_features=500),
                                 TfidfVectorizer(preprocessor=first_preprocessor,
                                                 ngram_range=ngram_range,
                                                 stop_words=stop_words,
                                                 max_features=500)
                                ])

# fit_transform a list of sentences into a matrix of token counts (e.g., word counts)
x_train_tokenized = vectorizer.fit_transform(x_train)

# combine the feature names and matrix of 1s and 0s to a dataframe
x_train = pd.DataFrame( x_train_tokenized.toarray(),
                        columns=vectorizer.get_feature_names_out(),
                        index=x_train.index) # take index of original dataframe

# preview the dataframe
print(x_train.head())


x_test_tokenized = vectorizer.transform(x_test)
x_test = pd.DataFrame( x_test_tokenized.toarray(),
                      columns=vectorizer.get_feature_names_out(),
                      index=x_test.index) # take index of original dataframe

# preview the dataframe
print(x_test.head())



(4991,) (4991,)
(999,) (999,)
       able  absolutely  account  actually  advice  afford  afraid  age  ago  \
0  0.000000  0.0         0.0      0.000000  0.0     0.0     0.0     0.0  0.0   
1  0.000000  0.0         0.0      0.165595  0.0     0.0     0.0     0.0  0.0   
2  0.165124  0.0         0.0      0.000000  0.0     0.0     0.0     0.0  0.0   
3  0.000000  0.0         0.0      0.000000  0.0     0.0     0.0     0.0  0.0   
4  0.000000  0.0         0.0      0.000000  0.0     0.0     0.0     0.0  0.0   

   alcohol  ...  wrong   xb  yeah  year  years  yes  yesterday  young  youã  \
0  0.0      ...  0.0    0.0  0.0   0.0   0.0    0.0  0.0        0.0    0.0    
1  0.0      ...  0.0    0.0  0.0   0.0   0.0    0.0  0.0        0.0    0.0    
2  0.0      ...  0.0    0.0  0.0   0.0   0.0    0.0  0.0        0.0    0.0    
3  0.0      ...  0.0    0.0  0.0   0.0   0.0    0.0  0.0        0.0    0.0    
4  0.0      ...  0.0    0.0  0.0   0.0   0.0    0.0  0.0        0.0    0.0    

   âªã  
0  0.

In [4]:
x_train["num_words"]  = np.zeros(4991)
x_test["num_words"]    = np.zeros(999)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(4991, 501) (4991,)
(999, 501) (999,)


In [5]:
# evaluate model, use test dataset to evaluate the model on AUC score


model = xgboost.XGBClassifier(n_estimators=100,
                              max_depth=20,
                              learning_rate=0.1,
                              random_state=0)
model.fit(x_train, y_train)

prediction = model.predict_proba(x_test)
print(prediction)
print()
print("ROC AUC SCORE:", roc_auc_score(y_test, prediction[:,1]))


[[0.93121064 0.06878935]
 [0.3586042  0.6413958 ]
 [0.00773835 0.99226165]
 ...
 [0.00217301 0.997827  ]
 [0.05706757 0.9429324 ]
 [0.1223011  0.8776989 ]]

ROC AUC SCORE: 0.9916310184033323


The ROC AUC SCORE exceeds 0.99.

In [8]:
# connect your model to Anvil


!pip install anvil-uplink
import anvil.server



# Connect to my anvil server
anvil.server.connect("NRJUXU5RGKXINVLJFWJM6ROW-U2XEEGCOY2LICZMV")


@anvil.server.callable
# this section is harder and thus written already for the in-class submission
def sentiment(text):

  # transform the input text to a vector
  text_token = vectorizer.transform([text]).toarray()

  # split the text into number of words
  num_words = len(text.split(' '))

  # append the tokenized words with the number of words
  x = np.append(text_token, num_words)

  # place the dataset in a dataframe
  x = pd.DataFrame([x], columns=x_train.columns)

  # run the predictions on that dataframe
  prediction = model.predict_proba(x)


  # return (label, score)
  if prediction[0,1] >= 0.5:
    label = "POSITIVE"
    score = prediction[0,1]
  else:
    label = "NEGATIVE"
    score = prediction[0,0]

  if label == "NEGATIVE":
    score = prediction[0,1]
  elif label == "POSITIVE":
    score = prediction[0,0]
  return score

  anvil.server.wait_forever()





Collecting argparse (from anvil-uplink)
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


Disconnecting from previous connection first...
Connecting to wss://anvil.works/uplink
Anvil websocket closed (code 1000, reason=b'')
Anvil websocket open
Connected to "Default environment" as SERVER


In [7]:
sentiment("hi")

0.055531643