# Welcome to GGR274

- What is Data Science?
- What is the role of Statistics in Data Science?

## Where is Data Science Used in Society?

### Questions

- Is there a Bike Share bike available?


- Will this person commit a crime in the future?


- Does COVID-19 affect males more than females?


- What movies might this person enjoy?


- How many people attended Indian Day Schools in Canada?

## Bike Share Toronto

**Is there a bike available at St. George St. /Hoskin Ave.?**

<a href= 'https://bikesharetoronto.com'> <img src = 'bikesharescreenshot.png'> </a>



## Visualization using maps 

<mark> **Don't worry about understanding the code today - it's complicated, but we will be learning as the course goes on** </mark>

In [1]:
import folium

m = folium.Map(location=[43.664288100559816, -79.39800603825044], zoom_start =18)
m

## Bike Share data is ugly

But, it can be made beautiful with a little bit of programming ...

## Step 1

- Read the data from <https://tor.publicbikesystem.net/ube/gbfs/v1/en/station_information> into Python.

```
{"last_updated":1637611674,"ttl":30,"data":{"stations":[{"station_id":"7000","name":"Fort York  Blvd / Capreol Ct","physical_configuration":"REGULAR","lat":43.639832,"lon":-79.395954,"altitude":0.0,"address":"Fort York  Blvd / Capreol Ct","capacity":35,"rental_methods":["KEY","TRANSITCARD","CREDITCARD","PHONE"],"groups":[],"obcn":"647-643-9607","nearby_distance":500.0},{"station_id":"7001","name":"Wellesley Station Green P","physical_configuration":"REGULAR","lat":43.66496415990742,"lon":-79.38355031526893,"altitude":0.0,"address":"Yonge / Wellesley","post_code":"M4Y 1G7","capacity":17,"rental_methods":["KEY","TRANSITCARD","CREDITCARD","PHONE"],"groups":[],"obcn":"416-617-9576","nearby_distance":500.0},

```

&#128576; Don't worry about understanding all the details for now ... 

In [3]:
import pandas as pd

station_url = 'https://tor.publicbikesystem.net/ube/gbfs/v1/en/station_information'
stationinfo = pd.read_json(station_url)

stationlist = stationinfo['data'].iloc[0]

The first station data looks a bit better:

In [4]:
stationlist[0]

{'station_id': '7000',
 'name': 'Fort York  Blvd / Capreol Ct',
 'physical_configuration': 'REGULAR',
 'lat': 43.639832,
 'lon': -79.395954,
 'altitude': 0.0,
 'address': 'Fort York  Blvd / Capreol Ct',
 'capacity': 35,
 'is_charging_station': False,
 'rental_methods': ['KEY', 'TRANSITCARD', 'CREDITCARD', 'PHONE'],
 'groups': [],
 'obcn': '647-643-9607',
 'nearby_distance': 500.0,
 '_ride_code_support': True,
 'rental_uris': {}}

## Step 2

- A bit of programming is needed to **transform the data** into a format that can be displayed on the map above.

In [5]:
# convert JSON format to GeoJSON format 
def featuredict(lon,lat, id1, name, capacity):
    dict = {
        "type": "Feature",
        "geometry": {
            "type": "Point",
            "coordinates":  [lon,lat]
        },
        "properties": {
            "station_id": id1,
            "name": name,
            "capacity": capacity
        }
    }
    return(dict)


# featuredict1 = create_station_list
def create_station_list(l):
    m = []
    for x in range(0,len(l)):
        m.append(
            featuredict(
                l[x]['lon'],
                l[x]['lat'],
                l[x]['station_id'],
                l[x]['name'],
                l[x]['capacity']
            )
        )
        x += 1
    return(m)

stations = create_station_list(stationlist)
stations[0:2]

[{'type': 'Feature',
  'geometry': {'type': 'Point', 'coordinates': [-79.395954, 43.639832]},
  'properties': {'station_id': '7000',
   'name': 'Fort York  Blvd / Capreol Ct',
   'capacity': 35}},
 {'type': 'Feature',
  'geometry': {'type': 'Point',
   'coordinates': [-79.38355031526893, 43.66496415990742]},
  'properties': {'station_id': '7001',
   'name': 'Wellesley Station Green P',
   'capacity': 23}}]

## Step 2 (Cont'd)

- More data transformation ...

In [8]:
# convert stations list to GeoJSON format

def add_GeoJSON_formatting(features):
    dict = {
        "type": "FeatureCollection",
        "features": features
    }
    return(dict)


stations_geoj = add_GeoJSON_formatting(stations)
stations_geoj['features'][0]

{'type': 'Feature',
 'geometry': {'type': 'Point', 'coordinates': [-79.395954, 43.639832]},
 'properties': {'station_id': '7000',
  'name': 'Fort York  Blvd / Capreol Ct',
  'capacity': 35}}

## A Data Science Application

**Question:** How many bikes are available

**Data:** TO Bike Share data

**Communicate the results:** Visualize the data on a map where the bikes are located

**Next Steps:** Predict how many bikes are available at hourly intervals

In [13]:
folium.GeoJson(
    stations_geoj, 
    name = "Bikes", 
    tooltip = folium.features.GeoJsonTooltip(fields=['station_id','name','capacity'], localize=True)
).add_to(m)
m

## How popular are tweets UofT and Social Science

We can search twitter by going to <https://twitter.com> 

<a href='https://twitter.com/search?q=UofT%20%26%20Social%20Science&src=typed_query&f=live'> <img src='twitterscreenshot.png' width=500 height=600> </a>

Alternatively, we can write a program to extract the *data* (why do I use the word data?)

Ignore the actual code - you will learn about this later in the course.

In [17]:
%run twittertokens.py

In [20]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

## Search twitter by programming

Let's search twitter *programmatically* using the *query* `UofT & social science` for the latest 10 tweets to contain these terms.

In [22]:
#tweets = tweepy.Cursor(api.search_tweets, q = 'Ukraine', lang = 'en').items(10)

# tweets = api.search_tweets(q = 'UofT & Data Science', lang = 'en', count = 20)
# tweets

We can view 

- the text from the tweets
- the user who sent the tweet
- number of times the tweet was retweeted

In [None]:
retweetcounts_list = []
user_list = []

for tweet in tweets:
    print('The text is:', tweet.text)
    print('The user is:', tweet.user.name)
    print('The tweet recount is:',tweet.retweet_count)
    retweetcounts_list.append(tweet.retweet_count)
    user_list.append(tweet.user.name)
    print('\n')

Let's have a look at the data

In [None]:
retweetcounts_list, user_list

Plotting data usually helps us see patterns that we might not see otherwise.

In [None]:
import matplotlib.pyplot as plot

plot.bar(user_list, retweetcounts_list);
plot.xticks(rotation=75);

## How many people attended Federal Day Schools?


In [23]:
from IPython.display import IFrame
IFrame("https://indiandayschools.org/", 1200,1000)

![](FederalDaySchools/cover_expertreport.png)

![](FederalDaySchools/title_expertreport.png)

![](FederalDaySchools/estimates_expertreport.png)

![](FederalDaySchools/datapg1_expertreport.png)

![](FederalDaySchools/datapg2_expertreport.png)

![](FederalDaySchools/datapg3_expertreport.png)


![](FederalDaySchools/c-8149-00418.png)

![](FederalDaySchools/c-8171-00008.png)

## Questions 

- What is the provenance of the data used in the expert report?

- How does this data relate to the original documents?

- What were record retention policies of Day Schools?

- If we don't know the data's provenance then it's not possible to assess the reliability of the total number of people that attended Day School from this data.