# Introduction

## Machine Learning & Spark

The most common problems apply either Regression or Classification. A regression model learns to predict a number. A classification model predicts a discrete or categorical value.

## Connecting to Spark

With the pyspark module loaded, you are able to connect to Spark. Then you need to tell Spark where the cluster is located. You can either connect to a remote cluster, in which case you need to specify a Spark URL which gives the network location of the cluster's master node. An URL must include a port number. Or you can create a local cluster, where everything happens on a single computer. For a local cluster, you need only specify "local" and optionally the number of cores to use. By default, a local cluster will run on a single core (local[4] or local[*]). 

You connect to Spark by creating a SparkSession object. Once the session has been created you are able to interact with Spark. You can stop the SparkSession (spark.stop()) wher you are done.

### Creating a SparkSession

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
print(spark.version)
spark.stop()

3.1.2


## Loading Data

Spark represenets tabular data using the DataFrame class. The data are captured as rows ("records"), each of which is broken down into one or more columns ("fields"). Every column has a name and a specific data type.

.csv method treats all columns as strings by default. There are two ways to get the correct columns types. Infer the column types (inferSchema=True) from the data or amnually specify the types. inferSchema will increase load time notably if the data file is big. Spark usually looks at the first values so NA values of an integer column would be assigned as an integer column. You can specify the null value spaceholder by using nullValue.

You can use StructType and StructField to specify the type of each column in an explicit schema.

### Loading flights data

In [12]:
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext("local", "pyspark-shell")

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

flights = spark.read.csv("flights.csv", sep=",", header=True, inferSchema=True, nullValue="NA")
print("The data contain %d records"%flights.count())
flights.show(5)
print(flights.dtypes)

The data contain 50000 records
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


### Loading SMS spam data


In [13]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType()),
])

sms = spark.read.csv("sms.csv", sep=";", header=False, schema=schema)
sms.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)

