# Inspection Data

This notebook was loaded with:

```bash
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./dse/bin/dse pyspark --num-executors 5 --driver-memory 6g --executor-memory 6g
```

We've already done some exploration in the Exploration notebook. Now, we'll clean the data and load it into Cassandra for use in MLLib model building jobs later.

In [1]:
%pylab inline 

Populating the interactive namespace from numpy and matplotlib


In [2]:
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import count, datediff, lag, sum, coalesce, rank, lit, when,col, udf, to_date, year, mean, month, date_format, array
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, DateType
from pyspark.ml.feature import StringIndexer
from datetime import datetime
from pyspark.sql.window import Window
import pyspark

Make sure the data is in DSEFS (DSE's Hadoop implementation). This should be reminiscent of the way Hadoop does things. Something like this:

```bash
./dse fs
mkdir datadir
put /Users/angelo/chicago/data/Food_Inspections.csv datadir/Food_Inspections.csv

```

Now, we can start working with it... let's load it into the spark context

In [3]:
df = sqlContext.read.csv("dsefs:///datadir/Food_Inspections.csv", sep="|", header="True")

In [4]:
df.select("Risk").distinct().collect()

[Row(Risk=None),
 Row(Risk=u'Risk 1 (High)'),
 Row(Risk=u'All'),
 Row(Risk=u'Risk 2 (Medium)'),
 Row(Risk=u'Risk 3 (Low)')]

In [5]:
#remove the other results rows
df = df[(df.Results == 'Pass') | (df.Results == 'Fail') | (df.Results == 'Pass w/ Conditions')]

#we only want restaurants
df = df[df["Facility Type"] == "Restaurant"]

#Clean up our classes
#let's set up some classes... 0=fail, 1=pass, 2=pass with conditions
Y_col = when(col("Results") == "Fail", 0).when(col("Results") == "Pass", 1).otherwise(2)
df = df.withColumn("y", Y_col)

We'll encode our categorial pass/fail in 2 ways... Y is did you pass? Y_fail is did you fail? One is more logically human readable (not so many double negatives) and the other we'll process.

In [6]:
Y_fail = when(col("Results") == "Fail", 1).otherwise(0)
df = df.withColumn("y_fail", Y_fail)

We don't need city and state (these are only Chicago's inspections), and we don't need the business names, etc.

In [7]:
#drop columns we don't care about
df = df.drop('Location') #just a dupe of Latitude/longitude
df = df.drop('State') #they're all chicago
df = df.drop('City')
df = df.drop('Inspection ID') #since we're going to aggregate anyway
df = df.drop('DBA Name') #this is in the licenses anyway
df = df.drop('AKA Name') #this is in the licenses anyway
df = df.drop('Address') #this is in the licenses anyway
df = df.drop('Facility Type') #this is in the licenses anyway, and is hand entered here... so very bad data

df.head(1)

[Row(License #=u'2093906', Risk=u'Risk 2 (Medium)', Zip=u'60615', Inspection Date=u'11/3/16', Inspection Type=u'Canvass', Results=u'Pass', Latitude=u'41.79517545', Longitude=u'-87.59660148', y=1, y_fail=0)]

In [8]:
df.select("License #").distinct().count()
#this is bigger than unique businesses, so they have multiple licenses... too bad I don't know the type...

15169

Those 78k inspections represent 15k businesses.

In [9]:
df.count()

78897

## Recoding the Facility Type and Inspection Type

In [10]:
df.select("Inspection Type").distinct().count()

64

There are 64 inspection types, but as we noticed in exploration, there are a ton of duplicates.

Let's clean up these inspection types as we did before:

In [11]:
df = df.replace(
    ['finish complaint inspection from 5-18-10','CANVASS/SPECIAL EVENT', 'CANVASS FOR RIB FEST', 'CANVASS SPECIAL EVENTS','LIQUOR CATERING', 'Task Force Liquor Catering','SPECIAL TASK FORCE', 'TASKFORCE', 'task force','Complaint-Fire', 'FIRE/COMPLAIN', 'fire complaint','Task force liquor inspection 1474', 'Task Force for liquor 1474','Package Liquor 1474', 'Task Force for liquor 1474', 'TASK FORCE LIQUOR 1474', 'TASK FORCE LIQUOR 1474','TAVERN 1470', 'task force(1470) liquor tavern','Task Force Liquor 1475'],
    ['Complaint','CANVASS SPECIAL EVENTS','CANVASS SPECIAL EVENTS','CANVASS SPECIAL EVENTS','Task Force Liquor Catering','Task Force Liquor Catering','Special Task Force','Special Task Force','Special Task Force','Fire Complaint','Fire Complaint','Fire Complaint','TASK FORCE PACKAGE LIQUOR','TASK FORCE PACKAGE LIQUOR','TASK FORCE PACKAGE LIQUOR','TASK FORCE PACKAGE LIQUOR','TASK FORCE PACKAGE LIQUOR','TASK FORCE PACKAGE LIQUOR','Task Force Liquor Tavern','Task Force Liquor Tavern','Task Force Liquor Tavern'],
    "Inspection Type")

df = df.replace(
    ['license','CLOSE-UP/COMPLAINT REINSPECTION', 'REINSPECTION OF CLOSE-UP','1315 license reinspection', 'License Re-Inspection','TASK FORCE LIQUOR 1470', 'Task Force 1470 Liquor Tavern', 'TASK FORCE LIQUOR 1470','CANVASS RE INSPECTION OF CLOSE UP','LICENSE TASK FORCE / NOT -FOR-PROFIT CLUB', 'LICENSE TASK FORCE / NOT -FOR-PROFIT CLU','TASK FORCE LIQUOR (1481)','license task 1474','TASK FORCE PACKAGE GOODS 1474'],
    ['License','REINSPECTION OF CLOSE-UP','REINSPECTION OF CLOSE-UP','License Reinspection','License Reinspection','Task Force Liquor Tavern','Task Force Liquor Tavern','Task Force Liquor Tavern','REINSPECTION OF CLOSE-UP','Task Force Not-For-Profit Club','Task Force Not-For-Profit Club','TASK FORCE PACKAGE LIQUOR','License-Task Force','License-Task Force'],
    "Inspection Type")

df = df.replace(
    ['KIDS CAFE','CANVAS','LICENSE','No entry', 'NO ENTRY', 'no entry','LICENSE CONSULTATION','LICENSE RENEWAL INSPECTION FOR DAYCARE', 'LICENSE RENEWAL FOR DAYCARE',  'LICENSE DAYCARE 1586','TWO PEOPLE ATE AND GOT SICK.', 'Suspected Food Poisoning', 'SFP/Complaint', 'SFP', 'sfp/complaint', 'SFP/COMPLAINT','Suspected Food Poisoning Re-inspection', 'SFP RECENTLY INSPECTED','out ofbusiness', 'OUT OF BUSINESS'],
    ['Kids Cafe',"Canvass",'License','No Entry', 'No Entry', 'No Entry', 'License consultation','DAY CARE LICENSE RENEWAL','DAY CARE LICENSE RENEWAL','DAY CARE LICENSE RENEWAL','Suspected Food Poisoning','Suspected Food Poisoning','Suspected Food Poisoning','Suspected Food Poisoning','Suspected Food Poisoning','Suspected Food Poisoning','Suspected Food Poisoning Reinspection','Suspected Food Poisoning Reinspection','Out of Business','Out of Business'],
    "Inspection Type")

Now, we'll extract some features from these inspection types by one-hot encoding some of them

In [12]:
df.head(5)

[Row(License #=u'2093906', Risk=u'Risk 2 (Medium)', Zip=u'60615', Inspection Date=u'11/3/16', Inspection Type=u'Canvass', Results=u'Pass', Latitude=u'41.79517545', Longitude=u'-87.59660148', y=1, y_fail=0),
 Row(License #=u'2476569', Risk=u'Risk 1 (High)', Zip=u'60621', Inspection Date=u'11/2/16', Inspection Type=u'License Reinspection', Results=u'Pass', Latitude=u'41.77985559', Longitude=u'-87.64514243', y=1, y_fail=0),
 Row(License #=u'2476568', Risk=u'Risk 1 (High)', Zip=u'60621', Inspection Date=u'11/2/16', Inspection Type=u'License Reinspection', Results=u'Pass', Latitude=u'41.77985559', Longitude=u'-87.64514243', y=1, y_fail=0),
 Row(License #=u'2354431', Risk=u'Risk 1 (High)', Zip=u'60615', Inspection Date=u'11/2/16', Inspection Type=u'Canvass', Results=u'Pass w/ Conditions', Latitude=u'41.80190189', Longitude=u'-87.62192629', y=2, y_fail=0),
 Row(License #=u'1767714', Risk=u'Risk 1 (High)', Zip=u'60622', Inspection Date=u'11/2/16', Inspection Type=u'Complaint', Results=u'Fail',

We need to cache the data often here... don't believe me? comment them out and give it a shot!

In [13]:
df.cache()

DataFrame[License #: string, Risk: string, Zip: string, Inspection Date: string, Inspection Type: string, Results: string, Latitude: string, Longitude: string, y: int, y_fail: int]

We'll also categorize them a bit better. We'll make some features by grouping some of the types of inspections. For example, does it matter if this was any type of reinspection? What about a canvassing, are those effective?

In [14]:
x_col = when((col("Inspection Type") == 'Complaint Re-Inspection') |
             (col("Inspection Type") == 'Canvass Re-Inspection') |
             (col("Inspection Type") == 'Complaint-Fire Re-inspection') |
             (col("Inspection Type") == 'RECALL INSPECTION') |
             (col("Inspection Type") == 'REINSPECTION OF 48 HOUR NOTICE') |
             (col("Inspection Type") == 'REINSPECTION') |
             (col("Inspection Type") == 'Suspected Food Poisoning Reinspection') \
             , 1).otherwise(0)

df2 = df.withColumn("reinspection", x_col)

x_col = when((col("Inspection Type") == 'Recent Inspection') \
             , 1).otherwise(0)

df2 = df2.withColumn('recent_inspection', x_col)

x_col = when((col("Inspection Type") == 'License-Task Force') |
             (col("Inspection Type") == 'Task Force Liquor Tavern') |
             (col("Inspection Type") == 'Task Force Liquor Catering') |
             (col("Inspection Type") == 'TASK FORCE NIGHT') |
             (col("Inspection Type") == 'Special Task Force') |
             (col("Inspection Type") == 'TASK FORCE PACKAGE LIQUOR') |
             (col("Inspection Type") == 'Task Force Not-For-Profit Club') \
             , 1).otherwise(0)

df2 = df2.withColumn('task_force', x_col)

x_col = when((col("Inspection Type") == 'Special Events (Festivals)') |
             (col("Inspection Type") == 'CANVASS SPECIAL EVENTS') |
             (col("Inspection Type") == 'Summer Feeding') |
             (col("Inspection Type") == 'TASTE OF CHICAGO') \
             , 1).otherwise(0)

df2 = df2.withColumn('special_event', x_col)

x_col = when((col("Inspection Type") == 'Canvass') |
             (col("Inspection Type") == 'Canvass Re-Inspection') |
             (col("Inspection Type") == 'CANVASS SPECIAL EVENTS') \
             , 1).otherwise(0)

df2 = df2.withColumn('canvass', x_col)

x_col = when((col("Inspection Type") == 'Task Force Liquor Tavern') |
             (col("Inspection Type") == 'TASK FORCE PACKAGE LIQUOR') |
             (col("Inspection Type") == 'Task Force Liquor Catering') \
             , 1).otherwise(0)

df2 = df2.withColumn('liquor', x_col)
   
             
x_col = when((col("Inspection Type") == 'Complaint-Fire Re-inspection') |
             (col("Inspection Type") == 'Fire Complaint') |
             (col("Inspection Type") == 'Short Form Fire-Complaint')  \
             , 1).otherwise(0)

df2 = df2.withColumn('fire', x_col)
             
x_col = when((col("Inspection Type") == 'Complaint') |
             (col("Inspection Type") == 'Complaint Re-Inspection') |
             (col("Inspection Type") == 'Short Form Complaint') |
             (col("Inspection Type") == 'Complaint-Fire Re-inspection') |
             (col("Inspection Type") == 'Short Form Fire-Complaint') |
             (col("Inspection Type") == 'SMOKING COMPLAINT') |
             (col("Inspection Type") == 'NO ENTRY-SHORT COMPLAINT)') |
             (col("Inspection Type") == 'Fire Complaint') \
             , 1).otherwise(0)

df2 = df2.withColumn('complaint', x_col)

                          
x_col = when((col("Inspection Type") == 'License') |
             (col("Inspection Type") == 'License Reinspection') |
             (col("Inspection Type") == 'OWNER SUSPENDED OPERATION/LICENSE') |
             (col("Inspection Type") == 'License Consultation') |
             (col("Inspection Type") == 'Pre-License Consultation') |
             (col("Inspection Type") == 'DAYCARE LICENSE RENEWAL') |
             (col("Inspection Type") == 'LICENSE/NOT READY') |
             (col("Inspection Type") == 'LICENSE WRONG ADDRESS') |
             (col("Inspection Type") == 'LICENSE REQUEST') |
             (col("Inspection Type") == 'License-Task Force') \
             , 1).otherwise(0)

df2 = df2.withColumn('license_related', x_col)
             
#df2.head(5)

## Removing empty licenses and duplicate inspections

In [15]:
df2 = df2.filter((col("License #") != 0) & (col("Inspection Type") != 'Duplicated'))

Let's cleanup the set by changing the column names and including what we want.

In [16]:
df2 = df2.select(col("License #").alias("license_id"), col("Risk").alias("risk_description"), \
                 col("Zip").alias("zip"), col("Inspection Date").alias("inspection_date_string"), \
                 col("Results").alias("y_description"), col("Latitude").alias("latitude"), \
                 col("Longitude").alias("longitude"), col("y"), col("y_fail"), col("reinspection"),\
                 col("recent_inspection"), col("task_force"), col("special_event"), col("canvass"), \
                 col("liquor"), col("fire"), col("complaint"), col("license_related"), \
                 col("Inspection Type").alias("inspection_type_description"))

In [17]:
df2.cache()

DataFrame[license_id: string, risk_description: string, zip: string, inspection_date_string: string, y_description: string, latitude: string, longitude: string, y: int, y_fail: int, reinspection: int, recent_inspection: int, task_force: int, special_event: int, canvass: int, liquor: int, fire: int, complaint: int, license_related: int, inspection_type_description: string]

## Encoding Risk

In [18]:
#encode fix risk
x_col = when(col("risk_description") == 'Risk 1 (High)', 1) \
        .when(col("risk_description") == 'Risk 2 (Medium)', 2) \
        .otherwise(3)

df2 = df2.withColumn("risk", x_col)

## Encode Inspection Type

As mentioned earlier, we have 64 types of inspections. Let's see how many have less than 30 inspections (i.e. how many types of inspections are rare):

In [19]:
df2.groupby(col("inspection_type_description")).count().filter(col("count") <= 30).collect()

[Row(inspection_type_description=u'Pre-License Consultation', count=6),
 Row(inspection_type_description=u'CHANGED COURT DATE', count=1),
 Row(inspection_type_description=u'TASK FORCE PACKAGE LIQUOR', count=7),
 Row(inspection_type_description=u'TASK FORCE NOT READY', count=1),
 Row(inspection_type_description=u'LICENSE/NOT READY', count=1),
 Row(inspection_type_description=u'Special Events (Festivals)', count=21),
 Row(inspection_type_description=u'citation re-issued', count=1),
 Row(inspection_type_description=u'LICENSE REQUEST', count=1),
 Row(inspection_type_description=u'Not Ready', count=3),
 Row(inspection_type_description=u'LIQOUR TASK FORCE NOT READY', count=1),
 Row(inspection_type_description=u'NO ENTRY-SHORT COMPLAINT)', count=1),
 Row(inspection_type_description=u'CANVASS SPECIAL EVENTS', count=2),
 Row(inspection_type_description=u'Task Force Liquor Catering', count=2),
 Row(inspection_type_description=u'Sample Collection', count=1),
 Row(inspection_type_description=u'REI

Many of those types are used just once or twice. We can imagine that once or twice in 7 years of data isn't going to help our predictive power, so we'll drop them.

In [20]:
x_col = when((col("inspection_type_description") == 'Pre-License Consultation') |
             (col("inspection_type_description") == 'CHANGED COURT DATE') |
             (col("inspection_type_description") == 'TASK FORCE NOT READY') |
             (col("inspection_type_description") == 'LICENSE/NOT READY') |
             (col("inspection_type_description") == 'Special Events (Festivals)') |
             (col("inspection_type_description") == 'citation re-issued') |
             (col("inspection_type_description") == 'LICENSE REQUEST') |
             (col("inspection_type_description") == 'Not Ready') |
             (col("inspection_type_description") == 'LIQOUR TASK FORCE NOT READY') |
             (col("inspection_type_description") == 'NO ENTRY-SHORT COMPLAINT)') |
             (col("inspection_type_description") == 'CANVASS SPECIAL EVENTS') |
             (col("inspection_type_description") == 'Task Force Liquor Catering') |
             (col("inspection_type_description") == 'Sample Collection') |
             (col("inspection_type_description") == 'REINSPECTION OF CLOSE-UP') |
             (col("inspection_type_description") == 'error save') |
             (col("inspection_type_description") == 'Non-Inspection') |                          
             (col("inspection_type_description") == 'Special Task Force') |
             (col("inspection_type_description") == 'TASTE OF CHICAGO') |
             (col("inspection_type_description") == 'POSSIBLE FBI') |
             (col("inspection_type_description") == 'RE-INSPECTION OF CLOSE-UP') |
             (col("inspection_type_description") == 'HACCP QUESTIONAIRE') |
             (col("inspection_type_description") == 'expansion') |
             (col("inspection_type_description") == 'Task Force Not-For-Profit Club') |
             (col("inspection_type_description") == 'CORRECTIVE ACTION') |
             (col("inspection_type_description") == 'REINSPECTION') \
             , 1).otherwise(0)

df2 = df2.filter(x_col != 1)

In [24]:
#inspections_to_drop = df2.groupby(col("inspection_type_description")).count().filter(col("count") <= 30) \
#    .select("inspection_type_description").collect()

In [31]:
#inspections_to_drop = [str(i.inspection_type_description) for i in inspections_to_drop]

In [67]:
#df2.where(df2["inspection_type_description"] == array(*[x for x in inspections_to_drop]))

In [21]:
df2.select("inspection_type_description").distinct().count()

20

In [22]:
df2.count()

78769

Let's use StringIndexer to create a categorical feature from Inspection Type.

In [23]:
indexer = StringIndexer(inputCol="inspection_type_description", outputCol="inspection_type").fit(df2)
df2 = indexer.transform(df2)

## Encode Inspection Date

In [26]:
string2Date = udf (lambda s: datetime.strptime(s, '%m/%d/%y'), DateType())
df2 = df2.withColumn("inspection_dt", string2Date(df2["inspection_date_string"]))

In [27]:
#Quick sanity check
df2.select(year("inspection_dt")).distinct().toPandas()

Unnamed: 0,year(inspection_dt)
0,2015
1,2013
2,2014
3,2012
4,2016
5,2010
6,2011


In [28]:
df2 = df2.persist(pyspark.StorageLevel.MEMORY_AND_DISK)

### Week and Month

In [29]:
df2 = df2.withColumn("weekday_description", date_format(col("inspection_dt"), "E"))
df2 = df2.withColumn("month", month(col("inspection_dt")))

In [30]:
indexer = StringIndexer(inputCol="weekday_description", outputCol="weekday").fit(df2)
df2 = indexer.transform(df2)

#note: month is already numerically indexed

## Encoding "Did you pass last time?"

In [32]:
df_test = df2.withColumn('prev_fail', 
                        lag(df2['y_fail'], count=1, default=0) #1 row back and return 0 is no first inspection (no rows back)
                         .over(Window.partitionBy(df2["license_id"])
                             .orderBy(df2["inspection_dt"])))

In [33]:
#df_test = df_test.withColumn("prev_fail", coalesce(df_test.prev_fail, lit(0)))

In [34]:
df_test.columns

['license_id',
 'risk_description',
 'zip',
 'inspection_date_string',
 'y_description',
 'latitude',
 'longitude',
 'y',
 'y_fail',
 'reinspection',
 'recent_inspection',
 'task_force',
 'special_event',
 'canvass',
 'liquor',
 'fire',
 'complaint',
 'license_related',
 'inspection_type_description',
 'risk',
 'inspection_type',
 'inspection_dt',
 'weekday_description',
 'month',
 'weekday',
 'prev_fail']

## Cumulative Failures

In [35]:
df_test = df_test.withColumn('cumulative_failures', 
                        sum(df_test['y_fail'])
                         .over(Window.partitionBy(df_test["license_id"])
                             .orderBy(df_test["inspection_dt"])
                              .rowsBetween(-sys.maxsize,0))) #all previous up to the current one

In [36]:
df_test.select("cumulative_failures").distinct().collect()

[Row(cumulative_failures=0),
 Row(cumulative_failures=7),
 Row(cumulative_failures=6),
 Row(cumulative_failures=9),
 Row(cumulative_failures=5),
 Row(cumulative_failures=1),
 Row(cumulative_failures=10),
 Row(cumulative_failures=3),
 Row(cumulative_failures=12),
 Row(cumulative_failures=8),
 Row(cumulative_failures=11),
 Row(cumulative_failures=2),
 Row(cumulative_failures=4),
 Row(cumulative_failures=13),
 Row(cumulative_failures=14),
 Row(cumulative_failures=15),
 Row(cumulative_failures=16)]

## Encoding Have you ever failed?

In [37]:
df_test = df_test.withColumn("ever_failed", when(col("cumulative_failures") != 0, 1).otherwise(0))

## Encoding Cumulative Inspections

In [38]:
df_test = df_test.withColumn('cumulative_inspections', 
                        count(df_test["inspection_dt"]).over(Window.partitionBy(df_test["license_id"])
                              .orderBy(df_test["inspection_dt"])
                              .rowsBetween(-sys.maxsize,0))) #all previous up to the current one

## Encoding Proportion of Fails to Inspections

In [39]:
df_test = df_test.withColumn("proportion_past_failures", col("cumulative_failures")/col("cumulative_inspections"))

In [40]:
df_test.head()

Row(license_id=u'1042702', risk_description=u'Risk 1 (High)', zip=u'60614', inspection_date_string=u'8/27/10', y_description=u'Pass', latitude=u'41.93041201', longitude=u'-87.64388653', y=1, y_fail=0, reinspection=0, recent_inspection=0, task_force=0, special_event=0, canvass=1, liquor=0, fire=0, complaint=0, license_related=0, inspection_type_description=u'Canvass', risk=1, inspection_type=0.0, inspection_dt=datetime.date(2010, 8, 27), weekday_description=u'Fri', month=8, weekday=2.0, prev_fail=0, cumulative_failures=0, ever_failed=0, cumulative_inspections=1, proportion_past_failures=0.0)

## Encoding Number of Days Since Last Inspection

In [41]:
df_lag = df_test.withColumn('last_inspection_dt',                         
                        lag(df_test['inspection_dt'], count=1) 
                        .over(Window.partitionBy(df_test["license_id"])
                        .orderBy(df_test["inspection_dt"])))

df_test = df_lag.withColumn('days_since_last_inspection', 
                        datediff(df_lag.last_inspection_dt, df_lag.inspection_dt)) 

In [42]:
df_test = df_test.drop('last_inspection_dt')

## Save our Results!

In [43]:
df_test.dtypes

[('license_id', 'string'),
 ('risk_description', 'string'),
 ('zip', 'string'),
 ('inspection_date_string', 'string'),
 ('y_description', 'string'),
 ('latitude', 'string'),
 ('longitude', 'string'),
 ('y', 'int'),
 ('y_fail', 'int'),
 ('reinspection', 'int'),
 ('recent_inspection', 'int'),
 ('task_force', 'int'),
 ('special_event', 'int'),
 ('canvass', 'int'),
 ('liquor', 'int'),
 ('fire', 'int'),
 ('complaint', 'int'),
 ('license_related', 'int'),
 ('inspection_type_description', 'string'),
 ('risk', 'int'),
 ('inspection_type', 'double'),
 ('inspection_dt', 'date'),
 ('weekday_description', 'string'),
 ('month', 'int'),
 ('weekday', 'double'),
 ('prev_fail', 'int'),
 ('cumulative_failures', 'bigint'),
 ('ever_failed', 'int'),
 ('cumulative_inspections', 'bigint'),
 ('proportion_past_failures', 'double'),
 ('days_since_last_inspection', 'int')]

```cql
CREATE  KEYSPACE chicago_data 
   WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1};
```

```cql
CREATE TABLE chicago_data.inspections (
    license_id text,
    risk_description text,
    zip text,
    inspection_date_string text,
    inspection_type_description text,
    y_description text,
    latitude text,
    longitude text,
    y int,
    y_fail int,
    reinspection int,
    recent_inspection int,
    task_force int,
    special_event int,
    canvass int,
    fire int,
    liquor int,
    complaint int,
    license_related int,
    inspection_type int,
    risk int,
    inspection_dt date,
    prev_fail int,
    cumulative_failures int,
    weekday_description text,
    month int,
    weekday int,
    ever_failed int,
    cumulative_inspections int,
    proportion_past_failures double,
    days_since_last_inspection int,
    PRIMARY KEY (license_id, inspection_dt))
WITH CLUSTERING ORDER BY (inspection_dt DESC);
```

In [44]:
 df_test.write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="inspections", keyspace="chicago_data")\
    .save()