python -m spacy download en_core_web_smData Generation

Data was generated in 2 steps:

1.Initial Data was captured internally via Google Form which asked users for car issues they currently have or had in the past
Classified that data into: brakes, starter, other

2.Took this 'training set' and used Markovify to generate more data for our tutorial

In [11]:
!pip install -r requirements.txt



In [2]:
import pandas as pd
100
df = pd.read_csv('response.csv') 
df = df.fillna('')
df['response']=df.iloc[:,3]+df.iloc[:,5]+df.iloc[:,6]
df['issue'] = df.iloc[:,1]
df['symptom'] = df.iloc[:,2] + df.iloc[:,4]
subset = df.iloc[:,-3:]
subset

Unnamed: 0,response,issue,symptom
0,my brakes make a squeaking noise whenever I tr...,Brakes,Car makes grinding noise
1,super frustrating every time I start my car it...,Starter,Car starts then stops
2,I can't open the damn door to my car,Other,
3,I turn the key and nothing happens,Starter,Car doesn't start
4,Car doesn't always start when it's low on blin...,Starter,Car doesn't start
...,...,...,...
104,Parking brake doesn’t return once released,Brakes,"Car brakes, but then brakes disengage"
105,my lights do not work,Other,
106,I try to start the engine only to find that th...,Starter,Car doesn't start
107,The driver side window auto function does not ...,Other,


In [3]:
import markovify
import codecs

In [4]:
#markovify is a simple, extensible Markov chain generator
#Its primary use is for building Markov models of large corpora of text and generating random sentences from that.


#Function builds the model according to what issue (e.g. brakes, starter, other) is given
def train_markov_type(data, issue):
    return markovify.Text(data[data["issue"] == issue].response, retain_original=False, state_size=2)

#Function takes one of the 'issue' models and creates a randomly-generated sentence of length up to 200 characters.  Note only creates '1' sentence
def make_sentence(model, length=100):
    return model.make_short_sentence(length, max_overlap_ratio = .7, max_overlap_total=15)

#built models
other_model = train_markov_type(subset, "Other")
brakes_model = train_markov_type(subset, "Brakes")
starter_model = train_markov_type(subset, "Starter")

In [5]:
make_sentence(other_model)  #creates a sentence that should be an example of 'other' issue(category)

'There is a bit short these days.'

In [6]:
make_sentence(brakes_model)  #creates a sentence that should be an example of 'brakes' issue(category)

'Car takes too long to stop and sometimes very bumpy.'

In [7]:
make_sentence(starter_model)   #creates a sentence that should be an example of 'starter' issue(category)

'When I try to start the engine unless you hit the starter with a hammer.'

We can combine these models with relative weights

In [None]:
#create a compound model in which the sentences that come out will be 2x as many 'other' than 'brakes' or 'starters'

#compound_model = markovify.combine([other_model, brakes_model, starter_model], [14, 7, 7])  

In [None]:
#make 20 sentences out of the compound model - copy the text into a spreadsheet and check the count of the issues  (e.g. how many brake issue are there?)

for i in range(20):
    print(make_sentence(brakes_model))

In [8]:
import numpy

def generate_cases(models, weights=None):
    if weights is None:
        weights = [1] * len(models)
    
    choices = []
    
    total_weight = float(sum(weights))
    
    for i in range(len(weights)):
        choices.append((float(sum(weights[0:i+1])) / total_weight, models[i]))
    
    def choose_model():
        r = numpy.random.uniform()
        for (p, m) in choices:
            if r <= p:
                return m
        return choices[-1][1]


    def choose_from(c):
        idx = math.floor(numpy.random.uniform() * len(c))
        return c[idx]
    
    
    while True:
        yield (make_sentence(choose_model()), )
        #create input sentence (my car won't stop), output model type (brakes)
            

In [9]:
#compound_model = markovify.combine([other_model, brakes_model, starter_model], [14, 7, 7])  

t = generate_cases([other_model, brakes_model, starter_model], [3,4,4])  #actual sentences


In [10]:


[next(t) for i in range(5)]  #create 100 sentences



[('I tried starting my car it just stops again, what is wrong!',),
 ('I hear a rattling noise when i drive it above 60 mph',),
 ('But the car does not roll back up.',),
 ('The battery is getting old so my range is a lag when I go over a bump.',),
 ('Customer states breaks make noises and also take more time to stop.',)]

Checking for similarity (slow)

In [None]:
#https://stackoverflow.com/questions/54334304/spacy-cant-find-model-en-core-web-sm-on-windows-10-and-python-3-5-3-anacon
#in your terminal window, execute the following code, before loading 'en_core_web_sm':

#     cd vehicle-claims-processing/
#     python -m spacy download en_core_web_sm


In [None]:
#spacy is a free open-source library for NLP in python
#en_core_web_sm is an english pipeline optimized for cpu.  components: tok2vec, tagger, prser, senter, ner, attribure_rulter, lemmantizer
#load english tokenizer, tagger, parser and NER
#load english tokenizer, tagger, parser and NER

import spacy

In [None]:
#load english tokenizer, tagger, parser and NER
nlp = spacy.load('en_core_web_sm')  #the nlp is going to tokenize the lists dt_b, dt_a

dt_b = subset["response"]  #109 responses (from our google form) in our response.csv
dt_a = [next(t) for i in range(100)]  #created sentences from 3 models that were combined.  Remember the 3 models were (created) based on the the reponse.csv issues (categories)

import numpy as np
a = []
for sentence in dt_a:
    doc = nlp(sentence)
    m = 0
    for sentence1 in dt_b:
        doc1 = nlp(sentence1)
        if m < doc.similarity(doc1):
            m = doc.similarity(doc1)  #m is taking the highest similarity of all the comparisons made  (a[] is a bunch of numbers between 1 and -1)
    a.append(m)
        
print("Mean similarity: " + str(np.array(a).mean()))
print(a)

import seaborn as sns
sns.displot(a)

#plotting generated sentences vs google form sentences.  For each google form sentence, what was the most similiar in the list of generated sentences.
#have a fairly normal distributation which demonstrates that our nlp generation isn't bad :)

In [None]:
import cProfile

def timing(c):
    for _ in range(c):
        next(t)

cProfile.run('timing(2000)', 'generatestats')

In [None]:
import pstats
p = pstats.Stats('generatestats')
p.strip_dirs().sort_stats(-1).print_stats()