### Tour of the Python Ecosystem

As we've seen, Anaconda comes with a large variety of useful packages for data science and as we'll see soon, can install more using conda or pip.

In general, if you want to do something with Python, you can google 'name of thing' and 'Python':

In [None]:
import antigravity

However, the blessing can be a curse in disguise. How do we know which packages are trustworthy? How do I know which to use for my particular task?

Today we will go on a whirlwind tour of packages that are useful, feel free to chime in with any you like in the class slack.

### Getting Data

Getting data into python is often as simple as `pd.read_csv` using pandas, but sometimes our data is in a database or other format.

There are a ton of packages for interacting with databases. One of the more common is SQL-alchemy. SQL-alchemy is a way of wrapping a database object in a map, so we can work with data models, and reduce our queries to the bare minumum. For our purposes, it's a bit much, so we will go to one level lower, pyodbc.

Pyodbc stands for Python Open Database Connectivity. ODBC is an effort by database programmers to make a universal interface to databases, regardless of OS, database type, or programming language. It is a standard application programming interface (API).

More or less this means that the details and code of connecting to a postgresql database running on windows, or a mongoDB instance on Solaris can be forgotten about, and we can interact with them through the odbc connector.

Let's try and get some database data into python. There is a bit of hacking required to get pyodbc to connect to MAMP, so for now we will use a sqlite database file and sqlite3. 

Download the file chinook.db from [here](https://drive.google.com/open?id=1LlFlmdisDo6VuTZXHoZfTrTzuBuOnkQC).

We make a connection object, and then pass it to pandas `read_sql`. Pandas sends the query to the database, and gets the results as a dataframe:

In [1]:
import sqlite3
#import pyodbc
import pandas as pd

conn = sqlite3.connect('data/chinook.db')
#conn = pyodbc.connect('driver={};db={};host={};)
pd.read_sql('select * from playlists;', conn)

Unnamed: 0,PlaylistId,Name
0,1,Music
1,2,Movies
2,3,TV Shows
3,4,Audiobooks
4,5,90’s Music
5,6,Audiobooks
6,7,Movies
7,8,Music
8,9,Music Videos
9,10,TV Shows


In [11]:
### we can get all the info using the below command:
pd.read_sql('SELECT * FROM sqlite_master;', conn)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,albums,albums,2,"CREATE TABLE ""albums""\r\n(\r\n [AlbumId] IN..."
1,table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
2,table,artists,artists,4,"CREATE TABLE ""artists""\r\n(\r\n [ArtistId] ..."
3,table,customers,customers,5,"CREATE TABLE ""customers""\r\n(\r\n [Customer..."
4,table,employees,employees,8,"CREATE TABLE ""employees""\r\n(\r\n [Employee..."
5,table,genres,genres,10,"CREATE TABLE ""genres""\r\n(\r\n [GenreId] IN..."
6,table,invoices,invoices,11,"CREATE TABLE ""invoices""\r\n(\r\n [InvoiceId..."
7,table,invoice_items,invoice_items,13,"CREATE TABLE ""invoice_items""\r\n(\r\n [Invo..."
8,table,media_types,media_types,15,"CREATE TABLE ""media_types""\r\n(\r\n [MediaT..."
9,table,playlists,playlists,16,"CREATE TABLE ""playlists""\r\n(\r\n [Playlist..."


### Exercise

Get the following data into pandas data frames:

1. Get all customers (name, country and id) from the United States
2. Get all customers not from Brazil
3. Get all invoices from customers from the United States
4. Show the number of invoices per country

In [41]:
#1
#pd.read_sql('select * from customers',conn)

customers_USA = pd.read_sql('select CustomerId, FirstName, LastName, Country from customers where Country = "USA" ',conn)
display(customers_USA.head())

#2
customers_not_Brazil = pd.read_sql('select CustomerId, FirstName, LastName, Country from customers where Country <> "Brazil" ',conn)
display(customers_not_Brazil.head())

Unnamed: 0,CustomerId,FirstName,LastName,Country
0,16,Frank,Harris,USA
1,17,Jack,Smith,USA
2,18,Michelle,Brooks,USA
3,19,Tim,Goyer,USA
4,20,Dan,Miller,USA


Unnamed: 0,CustomerId,FirstName,LastName,Country
0,2,Leonie,Köhler,Germany
1,3,François,Tremblay,Canada
2,4,Bjørn,Hansen,Norway
3,5,František,Wichterlová,Czech Republic
4,6,Helena,Holý,Czech Republic


In [35]:
#3 - Get all invoices from USA customers
invoices = pd.read_sql('Select * from invoices',conn)
display(invoices[invoices['InvoiceId'].isin(customers_USA['CustomerId'])])

Unnamed: 0,InvoiceId,CustomerId,InvoiceDate,BillingAddress,BillingCity,BillingState,BillingCountry,BillingPostalCode,Total
15,16,21,2009-03-05 00:00:00,801 W 4th Street,Reno,NV,USA,89503,3.96
16,17,25,2009-03-06 00:00:00,319 N. Frances Street,Madison,WI,USA,53703,5.94
17,18,31,2009-03-09 00:00:00,194A Chain Lake Drive,Halifax,NS,Canada,B3S 1C5,8.91
18,19,40,2009-03-14 00:00:00,"8, Rue Hanovre",Paris,,France,75002,13.86
19,20,54,2009-03-22 00:00:00,110 Raeburn Pl,Edinburgh,,United Kingdom,EH4 1HH,0.99
20,21,55,2009-04-04 00:00:00,421 Bourke Street,Sidney,NSW,Australia,2010,1.98
21,22,57,2009-04-04 00:00:00,"Calle Lira, 198",Santiago,,Chile,,1.98
22,23,59,2009-04-05 00:00:00,"3,Raj Bhavan Road",Bangalore,,India,560001,3.96
23,24,4,2009-04-06 00:00:00,Ullevålsveien 14,Oslo,,Norway,0171,5.94
24,25,10,2009-04-09 00:00:00,"Rua Dr. Falcão Filho, 155",São Paulo,SP,Brazil,01007-010,8.91


In [46]:
#4 Show number of invoices per country
invoices.groupby('BillingCountry').size()

BillingCountry
Argentina          7
Australia          7
Austria            7
Belgium            7
Brazil            35
Canada            56
Chile              7
Czech Republic    14
Denmark            7
Finland            7
France            35
Germany           28
Hungary            7
India             13
Ireland            7
Italy              7
Netherlands        7
Norway             7
Poland             7
Portugal          14
Spain              7
Sweden             7
USA               91
United Kingdom    21
dtype: int64

### Requests

The [requests](http://docs.python-requests.org/en/master/) module allows us to write a custom wrapper around a web api. Many web sites provide a way of interacting with them programmatically through an API. The way it generally works is that you send a request to a certain URL, and the data is returned as a JSON response.

For example, if we sent a web request to the star wars api:

https://swapi.co/api/people/1/

We get a nicely formatted json of the data about that person.

Lots of websites have their own python API packages, so it may be easier to install these: [Github](https://github.com/PyGithub/PyGithub), [Google](https://developers.google.com/api-client-library/python/), [strava](https://github.com/hozn/stravalib), [Twitter](https://developer.twitter.com/en/docs/developer-utilities/twitter-libraries.html) etc. etc.

However, let's put together a quick request to swapi using requests:

In [71]:
import requests

#shows person 1 in people site
r = requests.get('https://swapi.co/api/people/1/')
r
#if response starts with a 2, response went through

<Response [200]>

The response object is a way of holding everything we got back. If it starts with `2`, that is a good sign that the request went well (you probably know what a 404 would mean).

We can use all the html verbs, to upload, download or delete data, if the api supports it.

Now we can see what is inside it, using the docs or tab completion.

In [72]:
r.json()

{'name': 'Luke Skywalker',
 'height': '172',
 'mass': '77',
 'hair_color': 'blond',
 'skin_color': 'fair',
 'eye_color': 'blue',
 'birth_year': '19BBY',
 'gender': 'male',
 'homeworld': 'https://swapi.co/api/planets/1/',
 'films': ['https://swapi.co/api/films/2/',
  'https://swapi.co/api/films/6/',
  'https://swapi.co/api/films/3/',
  'https://swapi.co/api/films/1/',
  'https://swapi.co/api/films/7/'],
 'species': ['https://swapi.co/api/species/1/'],
 'vehicles': ['https://swapi.co/api/vehicles/14/',
  'https://swapi.co/api/vehicles/30/'],
 'starships': ['https://swapi.co/api/starships/12/',
  'https://swapi.co/api/starships/22/'],
 'created': '2014-12-09T13:50:51.644000Z',
 'edited': '2014-12-20T21:17:56.891000Z',
 'url': 'https://swapi.co/api/people/1/'}

It is now up to us to parse the json - you can see there are links to other objects there, that we might want to request too.

For now, let's just use the easy pandas parser:

In [73]:
from pandas.io.json import json_normalize 

json_normalize(r.json())

Unnamed: 0,birth_year,created,edited,eye_color,films,gender,hair_color,height,homeworld,mass,name,skin_color,species,starships,url,vehicles
0,19BBY,2014-12-09T13:50:51.644000Z,2014-12-20T21:17:56.891000Z,blue,"[https://swapi.co/api/films/2/, https://swapi....",male,blond,172,https://swapi.co/api/planets/1/,77,Luke Skywalker,fair,[https://swapi.co/api/species/1/],"[https://swapi.co/api/starships/12/, https://s...",https://swapi.co/api/people/1/,"[https://swapi.co/api/vehicles/14/, https://sw..."


### Exercise

1. Request and normalize the data for people 1-10. Put it into a single dataframe.
2. Try to get the first 10 films too.
3. Explore the documentation for the requests library. Can you see how you might delete an object? What happens if you try it here?

In [78]:
#1 -- METHOD 1
import numpy as np
import pandas as pd

df = pd.DataFrame()
# display(df)
for i in np.arange(1,11):
    r = requests.get(f'https://swapi.co/api/people/{i}/')
    print(f"Person {i}")
    if i == 1:
        df = json_normalize(r.json())
    else:
        df = pd.concat([df, json_normalize(r.json())])

df.reset_index(drop = True, inplace = True)
display(df)

Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
Person 7
Person 8
Person 9
Person 10


Unnamed: 0,birth_year,created,edited,eye_color,films,gender,hair_color,height,homeworld,mass,name,skin_color,species,starships,url,vehicles
0,19BBY,2014-12-09T13:50:51.644000Z,2014-12-20T21:17:56.891000Z,blue,"[https://swapi.co/api/films/2/, https://swapi....",male,blond,172,https://swapi.co/api/planets/1/,77,Luke Skywalker,fair,[https://swapi.co/api/species/1/],"[https://swapi.co/api/starships/12/, https://s...",https://swapi.co/api/people/1/,"[https://swapi.co/api/vehicles/14/, https://sw..."
1,112BBY,2014-12-10T15:10:51.357000Z,2014-12-20T21:17:50.309000Z,yellow,"[https://swapi.co/api/films/2/, https://swapi....",,,167,https://swapi.co/api/planets/1/,75,C-3PO,gold,[https://swapi.co/api/species/2/],[],https://swapi.co/api/people/2/,[]
2,33BBY,2014-12-10T15:11:50.376000Z,2014-12-20T21:17:50.311000Z,red,"[https://swapi.co/api/films/2/, https://swapi....",,,96,https://swapi.co/api/planets/8/,32,R2-D2,"white, blue",[https://swapi.co/api/species/2/],[],https://swapi.co/api/people/3/,[]
3,41.9BBY,2014-12-10T15:18:20.704000Z,2014-12-20T21:17:50.313000Z,yellow,"[https://swapi.co/api/films/2/, https://swapi....",male,none,202,https://swapi.co/api/planets/1/,136,Darth Vader,white,[https://swapi.co/api/species/1/],[https://swapi.co/api/starships/13/],https://swapi.co/api/people/4/,[]
4,19BBY,2014-12-10T15:20:09.791000Z,2014-12-20T21:17:50.315000Z,brown,"[https://swapi.co/api/films/2/, https://swapi....",female,brown,150,https://swapi.co/api/planets/2/,49,Leia Organa,light,[https://swapi.co/api/species/1/],[],https://swapi.co/api/people/5/,[https://swapi.co/api/vehicles/30/]
5,52BBY,2014-12-10T15:52:14.024000Z,2014-12-20T21:17:50.317000Z,blue,"[https://swapi.co/api/films/5/, https://swapi....",male,"brown, grey",178,https://swapi.co/api/planets/1/,120,Owen Lars,light,[https://swapi.co/api/species/1/],[],https://swapi.co/api/people/6/,[]
6,47BBY,2014-12-10T15:53:41.121000Z,2014-12-20T21:17:50.319000Z,blue,"[https://swapi.co/api/films/5/, https://swapi....",female,brown,165,https://swapi.co/api/planets/1/,75,Beru Whitesun lars,light,[https://swapi.co/api/species/1/],[],https://swapi.co/api/people/7/,[]
7,unknown,2014-12-10T15:57:50.959000Z,2014-12-20T21:17:50.321000Z,red,[https://swapi.co/api/films/1/],,,97,https://swapi.co/api/planets/1/,32,R5-D4,"white, red",[https://swapi.co/api/species/2/],[],https://swapi.co/api/people/8/,[]
8,24BBY,2014-12-10T15:59:50.509000Z,2014-12-20T21:17:50.323000Z,brown,[https://swapi.co/api/films/1/],male,black,183,https://swapi.co/api/planets/1/,84,Biggs Darklighter,light,[https://swapi.co/api/species/1/],[https://swapi.co/api/starships/12/],https://swapi.co/api/people/9/,[]
9,57BBY,2014-12-10T16:16:29.192000Z,2014-12-20T21:17:50.325000Z,blue-gray,"[https://swapi.co/api/films/2/, https://swapi....",male,"auburn, white",182,https://swapi.co/api/planets/20/,77,Obi-Wan Kenobi,fair,[https://swapi.co/api/species/1/],"[https://swapi.co/api/starships/48/, https://s...",https://swapi.co/api/people/10/,[https://swapi.co/api/vehicles/38/]


In [80]:
#1 - METHOD 2 - Faster than above
requests_data = []
for i in range(1,11):
    r = requests.get(f'https://swapi.co/api/people/{i}/')
    print(f"Person {i}")
    requests_data.append(json_normalize(r.json()))

df = pd.concat(requests_data)

df.reset_index(drop = True, inplace = True)
df

Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
Person 7
Person 8
Person 9
Person 10


Unnamed: 0,birth_year,created,edited,eye_color,films,gender,hair_color,height,homeworld,mass,name,skin_color,species,starships,url,vehicles
0,19BBY,2014-12-09T13:50:51.644000Z,2014-12-20T21:17:56.891000Z,blue,"[https://swapi.co/api/films/2/, https://swapi....",male,blond,172,https://swapi.co/api/planets/1/,77,Luke Skywalker,fair,[https://swapi.co/api/species/1/],"[https://swapi.co/api/starships/12/, https://s...",https://swapi.co/api/people/1/,"[https://swapi.co/api/vehicles/14/, https://sw..."
1,112BBY,2014-12-10T15:10:51.357000Z,2014-12-20T21:17:50.309000Z,yellow,"[https://swapi.co/api/films/2/, https://swapi....",,,167,https://swapi.co/api/planets/1/,75,C-3PO,gold,[https://swapi.co/api/species/2/],[],https://swapi.co/api/people/2/,[]
2,33BBY,2014-12-10T15:11:50.376000Z,2014-12-20T21:17:50.311000Z,red,"[https://swapi.co/api/films/2/, https://swapi....",,,96,https://swapi.co/api/planets/8/,32,R2-D2,"white, blue",[https://swapi.co/api/species/2/],[],https://swapi.co/api/people/3/,[]
3,41.9BBY,2014-12-10T15:18:20.704000Z,2014-12-20T21:17:50.313000Z,yellow,"[https://swapi.co/api/films/2/, https://swapi....",male,none,202,https://swapi.co/api/planets/1/,136,Darth Vader,white,[https://swapi.co/api/species/1/],[https://swapi.co/api/starships/13/],https://swapi.co/api/people/4/,[]
4,19BBY,2014-12-10T15:20:09.791000Z,2014-12-20T21:17:50.315000Z,brown,"[https://swapi.co/api/films/2/, https://swapi....",female,brown,150,https://swapi.co/api/planets/2/,49,Leia Organa,light,[https://swapi.co/api/species/1/],[],https://swapi.co/api/people/5/,[https://swapi.co/api/vehicles/30/]
5,52BBY,2014-12-10T15:52:14.024000Z,2014-12-20T21:17:50.317000Z,blue,"[https://swapi.co/api/films/5/, https://swapi....",male,"brown, grey",178,https://swapi.co/api/planets/1/,120,Owen Lars,light,[https://swapi.co/api/species/1/],[],https://swapi.co/api/people/6/,[]
6,47BBY,2014-12-10T15:53:41.121000Z,2014-12-20T21:17:50.319000Z,blue,"[https://swapi.co/api/films/5/, https://swapi....",female,brown,165,https://swapi.co/api/planets/1/,75,Beru Whitesun lars,light,[https://swapi.co/api/species/1/],[],https://swapi.co/api/people/7/,[]
7,unknown,2014-12-10T15:57:50.959000Z,2014-12-20T21:17:50.321000Z,red,[https://swapi.co/api/films/1/],,,97,https://swapi.co/api/planets/1/,32,R5-D4,"white, red",[https://swapi.co/api/species/2/],[],https://swapi.co/api/people/8/,[]
8,24BBY,2014-12-10T15:59:50.509000Z,2014-12-20T21:17:50.323000Z,brown,[https://swapi.co/api/films/1/],male,black,183,https://swapi.co/api/planets/1/,84,Biggs Darklighter,light,[https://swapi.co/api/species/1/],[https://swapi.co/api/starships/12/],https://swapi.co/api/people/9/,[]
9,57BBY,2014-12-10T16:16:29.192000Z,2014-12-20T21:17:50.325000Z,blue-gray,"[https://swapi.co/api/films/2/, https://swapi....",male,"auburn, white",182,https://swapi.co/api/planets/20/,77,Obi-Wan Kenobi,fair,[https://swapi.co/api/species/1/],"[https://swapi.co/api/starships/48/, https://s...",https://swapi.co/api/people/10/,[https://swapi.co/api/vehicles/38/]


In [85]:
#2 - FIRST 20 FILMS
r_data = []
for i in range(1,11):
    r = requests.get(f"https://swapi.co/api/films/{i}/")
    r_data.append(json_normalize(r.json()))

films_df = pd.concat(r_data).reset_index(drop=True)
films_df

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  import sys


Unnamed: 0,characters,created,detail,director,edited,episode_id,opening_crawl,planets,producer,release_date,species,starships,title,url,vehicles
0,"[https://swapi.co/api/people/1/, https://swapi...",2014-12-10T14:23:31.880000Z,,George Lucas,2015-04-11T09:46:52.774897Z,4.0,It is a period of civil war.\r\nRebel spaceshi...,"[https://swapi.co/api/planets/2/, https://swap...","Gary Kurtz, Rick McCallum",1977-05-25,"[https://swapi.co/api/species/5/, https://swap...","[https://swapi.co/api/starships/2/, https://sw...",A New Hope,https://swapi.co/api/films/1/,"[https://swapi.co/api/vehicles/4/, https://swa..."
1,"[https://swapi.co/api/people/1/, https://swapi...",2014-12-12T11:26:24.656000Z,,Irvin Kershner,2017-04-19T10:57:29.544256Z,5.0,It is a dark time for the\r\nRebellion. Althou...,"[https://swapi.co/api/planets/4/, https://swap...","Gary Kurtz, Rick McCallum",1980-05-17,"[https://swapi.co/api/species/6/, https://swap...","[https://swapi.co/api/starships/15/, https://s...",The Empire Strikes Back,https://swapi.co/api/films/2/,"[https://swapi.co/api/vehicles/8/, https://swa..."
2,"[https://swapi.co/api/people/1/, https://swapi...",2014-12-18T10:39:33.255000Z,,Richard Marquand,2015-04-11T09:46:05.220365Z,6.0,Luke Skywalker has returned to\r\nhis home pla...,"[https://swapi.co/api/planets/5/, https://swap...","Howard G. Kazanjian, George Lucas, Rick McCallum",1983-05-25,"[https://swapi.co/api/species/1/, https://swap...","[https://swapi.co/api/starships/15/, https://s...",Return of the Jedi,https://swapi.co/api/films/3/,"[https://swapi.co/api/vehicles/8/, https://swa..."
3,"[https://swapi.co/api/people/2/, https://swapi...",2014-12-19T16:52:55.740000Z,,George Lucas,2015-04-11T09:45:18.689301Z,1.0,Turmoil has engulfed the\r\nGalactic Republic....,"[https://swapi.co/api/planets/8/, https://swap...",Rick McCallum,1999-05-19,"[https://swapi.co/api/species/1/, https://swap...","[https://swapi.co/api/starships/40/, https://s...",The Phantom Menace,https://swapi.co/api/films/4/,"[https://swapi.co/api/vehicles/33/, https://sw..."
4,"[https://swapi.co/api/people/2/, https://swapi...",2014-12-20T10:57:57.886000Z,,George Lucas,2015-04-11T09:45:01.623982Z,2.0,There is unrest in the Galactic\r\nSenate. Sev...,"[https://swapi.co/api/planets/8/, https://swap...",Rick McCallum,2002-05-16,"[https://swapi.co/api/species/32/, https://swa...","[https://swapi.co/api/starships/21/, https://s...",Attack of the Clones,https://swapi.co/api/films/5/,"[https://swapi.co/api/vehicles/4/, https://swa..."
5,"[https://swapi.co/api/people/1/, https://swapi...",2014-12-20T18:49:38.403000Z,,George Lucas,2015-04-11T09:45:44.862122Z,3.0,War! The Republic is crumbling\r\nunder attack...,"[https://swapi.co/api/planets/2/, https://swap...",Rick McCallum,2005-05-19,"[https://swapi.co/api/species/19/, https://swa...","[https://swapi.co/api/starships/48/, https://s...",Revenge of the Sith,https://swapi.co/api/films/6/,"[https://swapi.co/api/vehicles/33/, https://sw..."
6,"[https://swapi.co/api/people/1/, https://swapi...",2015-04-17T06:51:30.504780Z,,J. J. Abrams,2015-12-17T14:31:47.617768Z,7.0,Luke Skywalker has vanished.\r\nIn his absence...,[https://swapi.co/api/planets/61/],"Kathleen Kennedy, J. J. Abrams, Bryan Burk",2015-12-11,"[https://swapi.co/api/species/3/, https://swap...","[https://swapi.co/api/starships/77/, https://s...",The Force Awakens,https://swapi.co/api/films/7/,[]
7,,,Not found,,,,,,,,,,,,
8,,,Not found,,,,,,,,,,,,
9,,,Not found,,,,,,,,,,,,


### Web Scraping

Sometimes websites dont have nice data APIs to interact with. In these cases, we can try and scrape the data straight off the webpage, by using a bot. 

We will use requests and pandas again. Scrapy and Beautiful Soup are other powerful python crawlers in the same space.

Let's do some scraping. Let's get some information about the New Zealand rugby team, the All Blacks, from Wikipedia

In [86]:
#using requests to get the data
html_doc = requests.get('https://en.wikipedia.org/wiki/New_Zealand_national_rugby_union_team').content

We now have the html data in memory. So, how do we go from that to the data in the page?

We can parse out the tables using pandas:

In [87]:
tables = pd.read_html(html_doc)

You can probably see that most of the tables are a bit messy. Unfortunately, this is part of doing web crawling:

In [88]:
tables[5].head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,Year,Round,Played,Won,Drew,Lost,Pts For,Against
1,1987,Champions,6,6,0,0,298,52
2,1991,Third place,6,5,0,1,142,74
3,1995,Runners-up,6,5,0,1,327,119
4,1999,Fourth place,6,4,0,2,255,111


If we need to parse the same kind of page or table multiple times, we can use Beautiful Soup or Scrapy, that way we can automate the cleaning and scraping process. A common use for this is to scrape stock prices, we have the same tables and values multiple times for many stocks.

### Exercise

Scrape the table from the yahoo finance amazon stock price page: https://finance.yahoo.com/quote/AMZN/. Can you automate it to scrape Facebook, Google/Alphabet, Tesla, Micron Technology and NVIDIA?

### Checking Performance

So far we have only talked about using `%%timeit` to measure runtime, but how about other system resources?

We can use `psutil` to interact with the operating system, to tell us know how much memory we are taking. This is a very useful function to use in order to diagnose memory leaks, or tell you why your system is slowing down:

In [90]:
import os #we need the os imported to interact with
import psutil

def memory_usage_psutil():
    process = psutil.Process(os.getpid()) #this finds the id of our python process
    mem = process.memory_percent()  #returns percentage
    return mem

In [None]:
print(memory_usage_psutil())
import numpy as np

x = np.ones(10000000)

print(memory_usage_psutil())

We can also [profile our code](https://docs.python.org/3.6/library/profile.html), to see where our script is spending its time. This is a bit beyond the scope for today but it is very helpful to track down where a slow piece of code is getting stuck.

### Scipy Submodules

We have only scraped the surface of scipy. There are some super nice submodules inside it, for things like signal processing and other scientific applications. 

A couple of useful submodules are the sparse library, and the optimize library.

#### Scipy.sparse

When dealing with large data, it is common to have a lot of sparsity. When we dummify, or one hot encode data, a large percentage of our model matrix can be 0s. If we have a categorical variable, with thousands of possible values (like a supermarket item), our matrix can be thousands of values wide.

In this case, rather than hold a lot of zeroes, we can take advantage of the numerous methods that have been developed for holding sparse matrices. Scipy.sparse also has a large amount of linear algebra functions implemented in ways that take advantage of sparse data, and can run much faster than their dense alternatives, as well as the memory savings of sparsity. Check out the [library docs here](https://docs.scipy.org/doc/scipy/reference/sparse.html)

#### Sparse Formats

We can carry our data in a variety of different formats depending on the application we want. In general, we will use coordinate format for construction, and shift into CSR or CSC format for faster math. We might also want to use DOK format for fast lookups. The exact choice depends on your application, try them out and see if you need to dive into the library.

In [None]:
from scipy import sparse
import numpy as np

#COO format is based on tuples of coordinates:
row  = np.array([0, 3, 1, 0])
col  = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
sparse.coo_matrix((data, (row, col)), shape=(4, 4)).toarray()

In [None]:
#CSC and CSR format are very similar
data = np.array([1, 2, 3, 4, 5, 6])

#indices are either row indices (csc) or column indices (csr)
indices = np.array([0, 2, 2, 0, 1, 2])

#indptr is an array that points to new column starts in the 'indices' & 'data' arrays
indptr = np.array([0, 2, 3, 6])

sparse.csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()

Using the constructor functions above, we can create matrices and then carry out methods between them:

In [None]:
x = np.zeros((10000,1000))
x[1:3,3:5] = 1
x = sparse.coo_matrix(x)
print(x.data)
print(x.col)
print(x.row)

y = sparse.csr_matrix(x)
print(y.data)
print(y.indptr)
print(y.indices)

print(x.T.dot(y).todense())

### Exercise

1. Create sparse matrices using coo_matrix, csc_matrix and csr_matrix and the constructors above (ie., use the coo and pointer versions) in order to create sparse matrices that look like:

```
1 0 0 0 4
2 0 0 0 5
3 0 4 0 0
4 0 0 7 0
```

2. Compare the memory usage of the different answers, using our memory usage function from psutil. Does it match your intuition?

#### Scipy.optimize

We will now take a closer look at optimization which is a key part of data science. How do we find the minimal value of the objective function in our models?

The answer for many data scientists is to use a black box solver. There are many commercial solvers available, Cplex and Gurboi are two market leaders, with Python APIs. So far, most of the models we have run either use scipys built in solvers, or scikit-learns. Let's try and get an intuition for what is going on inside these black boxes.

We can think a little bit about how a function works. We want the mimimum for our example sum of square residuals. One way of establishing the max and min of a function is to take the derivative, and set it equal to 0:

$$ \frac{d}{dx}f(x) = 0. $$

If we know the function for our derivative, we can analytically determine the maximum and minimum. However, in all but the most trivial models, calculating the derivative of the function over the entire range, and making sure we find all 0s is an impossible task. Optimization is the way we try to solve this - there are many methods, dating back to Newton - (the `newton` in `newton-cg` you will see in a couple of lessons). 

In gradient descent, we calculate the nearby gradient steps (ie, the derivatives), find the nearest, and take a step down the surface. 

<img src="http://i.imgur.com/Ud0YGqX.png" width=700 height=700>

Scipy.optimize has a lot of ways of carrying out our optimization, and can get very technical. For now, let's try and minimize the function:

$$ y = x^4 - 2x -1 $$

In [91]:
from scipy import optimize 
from matplotlib import pyplot as plt
#%matplotlib inline

def f(x):
    return x**4-2*x-1

optimize.fmin_cg(f,0)

x = np.arange(-4,4,0.01)
plt.plot(x,f(x))
plt.scatter([0.79370052],[f(0.79370052)],color='red');
plt.grid()
plt.show()

Optimization terminated successfully.
         Current function value: -2.190551
         Iterations: 3
         Function evaluations: 24
         Gradient evaluations: 8


<Figure size 640x480 with 1 Axes>

### Summary

The Python package ecosystem is really big - generally, you need to learn how to read help, and how to learn, and from there you can find your way once you need to know a package.

If you think you need a new package for your capstone and can't figure it out, let us know!