python -m spacy download en_core_web_smData Generation

Data was generated in 2 steps:

1.Initial Data was captured internally via Google Form which asked users for car issues they currently have or had in the past
Classified that data into: brakes, starter, other

2.Took this 'training set' and used Markovify to generate more data for our tutorial

In [3]:
!pip install -r requirements.txt
import pandas as pd
import csv

Collecting markovify
  Downloading markovify-0.9.0.tar.gz (27 kB)
Collecting matplotlib
  Downloading matplotlib-3.4.1-cp38-cp38-manylinux1_x86_64.whl (10.3 MB)
[K     |████████████████████████████████| 10.3 MB 19.8 MB/s eta 0:00:01
[?25hCollecting numpy
  Downloading numpy-1.20.2-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB)
[K     |████████████████████████████████| 15.4 MB 110.1 MB/s eta 0:00:01
[?25hCollecting pandas
  Downloading pandas-1.2.3-cp38-cp38-manylinux1_x86_64.whl (9.7 MB)
[K     |████████████████████████████████| 9.7 MB 95.2 MB/s eta 0:00:01
[?25hCollecting pytest
  Downloading pytest-6.2.3-py3-none-any.whl (280 kB)
[K     |████████████████████████████████| 280 kB 118.3 MB/s eta 0:00:01
Collecting scipy
  Downloading scipy-1.6.2-cp38-cp38-manylinux1_x86_64.whl (27.2 MB)
[K     |████████████████████████████████| 27.2 MB 107.0 MB/s eta 0:00:01
[?25hCollecting seaborn
  Downloading seaborn-0.11.1-py3-none-any.whl (285 kB)
[K     |████████████████████████████████| 2

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 394.4 MB 3.4 MB/s eta 0:00:01
[?25hCollecting tensorflow_hub
  Downloading tensorflow_hub-0.11.0-py2.py3-none-any.whl (107 kB)
[K     |████████████████████████████████| 107 kB 121.3 MB/s eta 0:00:01
Collecting unidecode
  Downloading Unidecode-1.2.0-py2.py3-none-any.whl (241 kB)
[K     |████████████████████████████████| 241 kB 125.6 MB/s eta 0:00:01
[?25hCollecting cycler>=0.10
  Downloading cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.3.1-cp38-cp38-manylinux1_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 118.2 MB/s eta 0:00:01
[?25hCollecting pillow>=6.2.0
  Downloading Pillow-8.2.0-cp38-cp38-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 116.5 MB/s eta 0:00:01
Collecting iniconfig
  Downloading iniconfig-1.1.1-py2.py3-none-any.whl (5.0 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp38-cp38-manylinux2014_x86_64

In [4]:
df = pd.read_csv('response.csv') 
df = df.fillna('')
df['response']=df.iloc[:,3]+df.iloc[:,5]+df.iloc[:,6]
df['issue'] = df.iloc[:,1]
df['symptom'] = df.iloc[:,2] + df.iloc[:,4]
subset = df.iloc[:,-3:]
subset

Unnamed: 0,response,issue,symptom
0,my brakes make a squeaking noise whenever I tr...,Brakes,Car makes grinding noise
1,super frustrating every time I start my car it...,Starter,Car starts then stops
2,I can't open the damn door to my car,Other,
3,I turn the key and nothing happens,Starter,Car doesn't start
4,Car doesn't always start when it's low on blin...,Starter,Car doesn't start
...,...,...,...
104,Parking brake doesn’t return once released,Brakes,"Car brakes, but then brakes disengage"
105,my lights do not work,Other,
106,I try to start the engine only to find that th...,Starter,Car doesn't start
107,The driver side window auto function does not ...,Other,


In [6]:
import markovify
import codecs

In [7]:
#markovify is a simple, extensible Markov chain generator
#Its primary use is for building Markov models of large corpora of text and generating random sentences from that.


#Function builds the model according to what issue (e.g. brakes, starter, other) is given
def train_markov_type(data, issue):
    return markovify.Text(data[data["issue"] == issue].response, retain_original=False, state_size=2)

#Function takes one of the 'issue' models and creates a randomly-generated sentence of length up to 200 characters.  Note only creates '1' sentence
def make_sentence(model, length=100):
    return model.make_short_sentence(length, max_overlap_ratio = .7, max_overlap_total=15)

#built models
other_model = train_markov_type(subset, "Other")
brakes_model = train_markov_type(subset, "Brakes")
starter_model = train_markov_type(subset, "Starter")

We can combine these models with relative weights

In [8]:
#import numpy

def generate_cases(models, weights=None):
    if weights is None:
        weights = [1] * len(models)
    
    choices = [] # Array of tuples of weight and models
    
    total_weight = float(sum(weights))
    
    for i in range(len(weights)):
        choices.append((float(sum(weights[0:i+1])) / total_weight, models[i]))
    
    # Return a tuple of model and category that are randomly selected by given weights.
    def choose_model():
        r = numpy.random.uniform()
        for (model_weight, model) in choices:
            if r <= model_weight:
                return model
        return choices[-1][1]


    while True:
        local_model = choose_model() 
        # local_model[0]) is the markovify model, local_model[1] is the category
        yield make_sentence(local_model[0]), local_model[1]
            

In [11]:
import numpy

#Generate new sentences & classify them

generated_cases = generate_cases([(other_model,'other'), (brakes_model,'brakes'), (starter_model,'starter')], [14,7,7])

# Tuples with sentence and category
sentence_tuples = [next(generated_cases)  for i in range(200)]  # create 100 sentence/category tuples

# Write to csv file
with open('testdata1.csv', 'w') as file:
    writer = csv.writer(file, delimiter=',', lineterminator='\n')
    writer.writerows(sentence_tuples)