<a href="https://colab.research.google.com/github/geocarvalho/coursera-spe-ml-washington/blob/main/1-a_case_study_approach/week_1/ml_foundations_intro_to_turicreate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Foundations: A Case Study Approach

- Turi Create is a highly scalable machine learning library for Python, which also includes the SFrame, a highly-scalable library for data manipulation. With SFrame you are not limited to datasets that fit in memory, which allows you to deal with large datasets, even on a laptop.
> https://github.com/apple/turicreate

- Download and install Turi Create: https://github.com/apple/turicreate#installation. Note: it is not required that you use virtualenv, but it might be helpful, especially if you run into installation issues due to conflicting versions of software.

- The User Guide
> https://apple.github.io/turicreate/docs/userguide/

- More Detailed API Docs
> https://apple.github.io/turicreate/docs/api/

In [1]:
# Download the simple people dataset
!wget https://d396qusza40orc.cloudfront.net/phoenixassets/course1-for-students/people-example.csv


--2021-03-14 23:11:52--  https://d396qusza40orc.cloudfront.net/phoenixassets/course1-for-students/people-example.csv
Resolving d396qusza40orc.cloudfront.net (d396qusza40orc.cloudfront.net)... 13.32.80.199, 13.32.80.97, 13.32.80.200, ...
Connecting to d396qusza40orc.cloudfront.net (d396qusza40orc.cloudfront.net)|13.32.80.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 248 [text/csv]
Saving to: ‘people-example.csv’


2021-03-14 23:11:52 (34.7 MB/s) - ‘people-example.csv’ saved [248/248]



In [2]:
# Install Turi Create
!pip install turicreate

Collecting turicreate
[?25l  Downloading https://files.pythonhosted.org/packages/25/9f/a76acc465d873d217f05eac4846bd73d640b9db6d6f4a3c29ad92650fbbe/turicreate-6.4.1-cp37-cp37m-manylinux1_x86_64.whl (92.0MB)
[K     |████████████████████████████████| 92.0MB 45kB/s 
Collecting resampy==0.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/14/b6/66a06d85474190b50aee1a6c09cdc95bb405ac47338b27e9b21409da1760/resampy-0.2.1.tar.gz (322kB)
[K     |████████████████████████████████| 327kB 32.1MB/s 
[?25hCollecting tensorflow<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/3c/b3/3eeae9bc44039ceadceac0c7ba1cc8b1482b172810b3d7624a1cad251437/tensorflow-2.0.4-cp37-cp37m-manylinux2010_x86_64.whl (86.4MB)
[K     |████████████████████████████████| 86.4MB 63kB/s 
Collecting numba<0.51.0
[?25l  Downloading https://files.pythonhosted.org/packages/04/be/8c88cee3366de2a3a23a9ff1a8be34e79ad1eb1ceb0d0e33aca83655ac3c/numba-0.50.1-cp37-cp37m-manylinux2014_x86_64.whl (3.6

In [3]:
import turicreate

In [4]:
sf = turicreate.SFrame("people-example.csv")

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [5]:
sf

First Name,Last Name,Country,age
Bob,Smith,United States,24
Alice,Williams,Canada,23
Malcolm,Jone,England,22
Felix,Brown,USA,23
Alex,Cooper,Poland,23
Tod,Campbell,United States,22
Derek,Ward,Switzerland,25


In [6]:
sf.head()

First Name,Last Name,Country,age
Bob,Smith,United States,24
Alice,Williams,Canada,23
Malcolm,Jone,England,22
Felix,Brown,USA,23
Alex,Cooper,Poland,23
Tod,Campbell,United States,22
Derek,Ward,Switzerland,25


In [7]:
sf.tail()

First Name,Last Name,Country,age
Bob,Smith,United States,24
Alice,Williams,Canada,23
Malcolm,Jone,England,22
Felix,Brown,USA,23
Alex,Cooper,Poland,23
Tod,Campbell,United States,22
Derek,Ward,Switzerland,25


In [8]:
# Visualize any data structure in Turi Create for each column
sf.show()

In [11]:
# Create a disctribution graph
sf["age"].show()

In [14]:
# Show the values present in column
sf["Country"]

dtype: str
Rows: 7
['United States', 'Canada', 'England', 'USA', 'Poland', 'United States', 'Switzerland']

In [15]:
sf["age"]

dtype: int
Rows: 7
[24, 23, 22, 23, 23, 22, 25]

In [16]:
# Simple column operations
sf["age"].mean()

23.14285714285714

In [17]:
sf["age"].max()

25

In [18]:
# Create a new column
sf["Full Name"] = sf["First Name"] + " " + sf["Last Name"]
sf

First Name,Last Name,Country,age,Full Name
Bob,Smith,United States,24,Bob Smith
Alice,Williams,Canada,23,Alice Williams
Malcolm,Jone,England,22,Malcolm Jone
Felix,Brown,USA,23,Felix Brown
Alex,Cooper,Poland,23,Alex Cooper
Tod,Campbell,United States,22,Tod Campbell
Derek,Ward,Switzerland,25,Derek Ward


In [19]:
sf["age"] * sf["age"]

dtype: int
Rows: 7
[576, 529, 484, 529, 529, 484, 625]

In [20]:
# Use apply function for an advance transformation in the data
def transform_country(country):
  if country == "USA":
    return "United States"
  else:
    return country

print(transform_country("Brazil"))
print(transform_country("USA"))

Brazil
United States


In [21]:
sf["Country"] = sf["Country"].apply(transform_country)
sf

First Name,Last Name,Country,age,Full Name
Bob,Smith,United States,24,Bob Smith
Alice,Williams,Canada,23,Alice Williams
Malcolm,Jone,England,22,Malcolm Jone
Felix,Brown,United States,23,Felix Brown
Alex,Cooper,Poland,23,Alex Cooper
Tod,Campbell,United States,22,Tod Campbell
Derek,Ward,Switzerland,25,Derek Ward
