## Motivation

- The purpose of this part is to present an initial outline for the tools that can help us in the creation of the Data Story, based on the work done thus far. 
- By iteratively updating, and keeping the outline in mind, we believe it will help us to reflect around how we can present our results in an effective manner throughout the data analysis pipeline. This approach will allow us to keep an open mind to finding new insights throughout the project.  
- On the other hand, being conscious of the data story can also enable us to be more structured in our data analysis process, by developing the story we want to tell and finding the data that can support it. 

## Structure
- The initial outline will be focused on how we could approach communicating the results for the research questions as well as a first outline for in which order it could be natural to present them. 
- Further iterations will aim at becoming more and more focused on creating a coherent story, that is following a well thought-through order to improve readability.

### Q1/ What does the typical patent-holder look like today (Corporation, Universities, Governments, Individuals), and how has that evolved throughout time / geographies?
_Q1.1/ Is a migration of innovators through time visible in the data, e.g. a convergence towards certain innovation centers?  
Q1.2 / How has the number of assignees and inventors evolved through time for different patent types? Are there significant differences in these numbers between different geographies?_  

Get all the patents data for the year 2016.

In [1]:
import pandas as pd
import numpy as np
import os
import json
from pipeline import patentsviewAPI, data_clean

In [2]:
def get_full_year_data(year):
    filenames = [year + q for q in ['q1','q2','q3','q4']]
    date_from = [year + '-' + date for date in ['01-01','04-01','07-01','10-01']]
    date_to = [year + '-' + date for date in ['03-31','06-30','09-30','12-31']]
    filepath = 'data'

    data = {}
    
    for i,filename in enumerate(filenames):
        print(filepath,filename)
        if os.path.isfile(os.path.join(filepath,filename + '.json')):
            datafile = os.path.join(filepath,filename + '.json')
            data[filename] = data_clean(datafile)
            print('loaded from file')
        else:
            datafile = patentsviewAPI(filename, app_date_from = date_from[i], app_date_to = date_to[i])
            data[filename] = data_clean(datafile)
            
    for i, dataset in enumerate(data):
        if i == 0:
            assignees = data[dataset]['assignees'].copy()
            inventors = data[dataset]['inventors'].copy()
            patents = data[dataset]['patents'].copy()
            citations = data[dataset]['citations'].copy()
        else:
            assignees = assignees.append(data[dataset]['assignees'])
            inventors = inventors.append(data[dataset]['inventors'])
            patents = patents.append(data[dataset]['patents'])
            #citations = citations.merge(data[dataset]['citations'], how = 'outer').sum(axis=1)

    assignees.drop_duplicates(inplace=True)
    inventors.drop_duplicates(inplace=True)
        
    return assignees, inventors, patents, citations

Get all 2016 data

In [3]:
assignees, inventors, patents, citations = get_full_year_data('2016')

data 2016q1
86
fetching first page
fetching page 2
fetching page 3
fetching page 4
saving data
data 2016q2
86
fetching first page
fetching page 2
fetching page 3
fetching page 4
saving data
data 2016q3
86
fetching first page
fetching page 2
fetching page 3
saving data
data 2016q4
86
fetching first page
fetching page 2
fetching page 3
saving data


Get all names with 'samsung'

In [4]:
samsung_assignee_numbers = []
for row, val in enumerate(assignees.values):
    if val[1]:
        if 'samsung' in val[1].lower():
            samsung_assignee_numbers.append(assignees.index.values[row])

get all samsung patent_numbers

In [5]:
samsung_patents_2016 = []
for num in samsung_assignee_numbers:
    for i, assignee in enumerate(list(patents.assignees.values)):
        if num in assignee:
            samsung_patents_2016.append(patents.index.values[i])

get all cited patents by samsung patents

In [6]:
def get_patent_citations(patent_numbers, filename):
    i = 0
    while ((i+1)*100 < len(patent_numbers)):
        datafile = patentsviewAPI('temp' + str(i) + '.json', patent_number = patent_numbers[i*100:(i+1)*100])
        clean_data = data_clean(datafile)
        
        # here you can extract all the assignees, inventors and patents data
        # for now I just extract the cited patent, to check
        if i == 0:
            data = clean_data['citations'].copy()
        else:
            data = data.append(clean_data['citations'])
        i += 1

    datafile = patentsviewAPI('temp' + str(i) + '.json', 'data', patent_number = patent_numbers[i*100:])
    clean_data = data_clean(datafile)
    data = data.append(clean_data['citations'])

    # would be a good idea to add a line to save the data, to not have to query it again if we need it later
    
    return data

In [7]:
samsung_cited_patents_2016 = get_patent_citations(samsung_patents_2016,'samsung_cited_patents_2016.json')

1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1

In [8]:
samsung_cited_patents_2016

Unnamed: 0,count
8559235,7
7679133,7
8553466,7
8654587,7
7940350,4
7002182,4
7084420,4
7087932,4
7154124,4
7208725,4


In [9]:
len(samsung_patents_2016)

7302

In [10]:
len(samsung_cited_patents_2016)

50628

### Approach Q1
- Question 1 can be a natural way to progress after e.g. a short introduction about patents itself, providing the reader with insight into the main sources of patent creation as well as the potentially historic overview of the same metric.  
- Q1.1 focuses on how data evolves through time from a geographic standpoint, and using  maps to visualize the migration could be a good approach. Below (Q2) interactive graphs are discussed and this could be one alternative here as well. 
    - One approach is to use several maps from ascending chronological periods of time where the reader could see if patterns of innovation centers emerge
    - Another approach could be an interactive chart that enables toggling between different time periods by the reader, enabling comparison by using a single map.
- Q1.2. can potentially provide a good segway into what we brought up in the Project's abstract: _"While innovation is often portrayed as the product of either one charismatic leader - or a ragtag team of geniuses - in reality we suspect that innovations, however important, happen in small steps supported by large networks of people."_
    - This will also be addressed in more detail in research question 2.
    - Various barplots or line plots could provide helpful tools for answering the first part of the question, while the same tools might also be an initial way to approach the second part, after segmenting the data into geographies. Maps could also be explored here.

In [11]:
import folium
import numpy as np
import pandas as pd
from collections import namedtuple

In [12]:
def get_arrows(locations, color='blue', size=6, n_arrows=3):
    
    '''
    Get a list of correctly placed and rotated 
    arrows/markers to be plotted
    
    Parameters
    locations : list of lists of lat lons that represent the 
                start and end of the line. 
                eg [[41.1132, -96.1993],[41.3810, -95.8021]]
    arrow_color : default is 'blue'
    size : default is 6
    n_arrows : number of arrows to create.  default is 3

    Return
    list of arrows/markers
    '''
    
    Point = namedtuple('Point', field_names=['lat', 'lon'])
    
    # creating point from our Point named tupleb
    p1 = Point(locations[0][0], locations[0][1])
    p2 = Point(locations[1][0], locations[1][1])
    
    # getting the rotation needed for our marker.  
    # Subtracting 90 to account for the marker's orientation
    # of due East(get_bearing returns North)
    rotation = get_bearing(p1, p2) - 90
    
    # get an evenly space list of lats and lons for our arrows
    # note that I'm discarding the first and last for aesthetics
    # as I'm using markers to denote the start and end
    arrow_lats = np.linspace(p1.lat, p2.lat, n_arrows + 2)[1:n_arrows+1]
    arrow_lons = np.linspace(p1.lon, p2.lon, n_arrows + 2)[1:n_arrows+1]
    
    arrows = []
    
    #creating each "arrow" and appending them to our arrows list
    for points in zip(arrow_lats, arrow_lons):
        arrows.append(folium.RegularPolygonMarker(location=points, 
                      fill_color=color, number_of_sides=3, 
                      radius=size, rotation=rotation).add_to(m))
    return arrows

In [13]:
def get_bearing(p1, p2):
    
    '''
    Returns compass bearing from p1 to p2
    
    Parameters
    p1 : namedtuple with lat lon
    p2 : namedtuple with lat lon
    
    Return
    compass bearing of type float
    
    Notes
    Based on https://gist.github.com/jeromer/2005586
    '''
    
    long_diff = np.radians(p2.lon - p1.lon)
    
    lat1 = np.radians(p1.lat)
    lat2 = np.radians(p2.lat)
    
    x = np.sin(long_diff) * np.cos(lat2)
    y = (np.cos(lat1) * np.sin(lat2) 
        - (np.sin(lat1) * np.cos(lat2) 
        * np.cos(long_diff)))

    bearing = np.degrees(np.arctan2(x, y))
    
    # adjusting for compass bearing
    if bearing < 0:
        return bearing + 360
    return bearing

In [14]:
earth_center = (44.63, 28.77)
list = []
m = folium.Map(earth_center ,tiles='cartodbpositron', zoom_start=2)
for i in range(round(len(inventors['location'])/1000)):
    folium.Marker(inventors['location'][i], popup=str(i), icon=folium.Icon(color='red')).add_to(m)
    folium.Marker(inventors['last_location'][i], popup=str(i), icon=folium.Icon(color='green')).add_to(m)
    folium.PolyLine([inventors['location'][i], inventors['last_location'][i]], color='blue').add_to(m)
    arrows = get_arrows([inventors['location'][i], inventors['last_location'][i]], n_arrows=3)
    for arrow in arrows:
        arrow.add_to(m)
m


In [15]:
inventors.index

Index(['1383149', '1798151', '2171910', '2250909', '2641780', '2911653',
       '1333497', '2565480', '2945844', '2737298',
       ...
       '3818453', '3815373', '3817498', '3819120', '3818697', '3819132',
       '2815828', '3818746', '3817057', '3815935'],
      dtype='object', length=29043)

In [16]:
patents.head()

Unnamed: 0,assignees,date,inventors,type
9319491,[120684],2016-01-14,"[1383149, 1798151, 2171910, 2250909]",utility
9323199,[348786],2016-01-27,[2641780],utility
9326929,[],2016-01-06,[2911653],utility
9329018,[73247],2016-01-25,[1333497],utility
9329634,[121339],2016-01-22,"[745488, 2565480, 2945844]",utility


In [17]:
# Create the dictionnary whose keys are patents types
patents_by_type = {}
for t in patents.type.unique():
    patents_by_type[t] = patents[patents.type==t].drop("type",axis=1)
patents_by_type

{'utility':         assignees        date  \
 9319491  [120684]  2016-01-14   
 9323199  [348786]  2016-01-27   
 9326929        []  2016-01-06   
 9329018   [73247]  2016-01-25   
 9329634  [121339]  2016-01-22   
 9332390  [334442]  2016-01-17   
 9334318  [341867]  2016-01-11   
 9335411  [307296]  2016-01-25   
 9336070  [154765]  2016-01-08   
 9336152   [49845]  2016-01-20   
 9336206   [80271]  2016-01-21   
 9336662        []  2016-01-26   
 9337046  [126788]  2016-01-21   
 9338512  [223060]  2016-01-04   
 9339756        []  2016-01-08   
 9340283  [256950]  2016-01-06   
 9342492  [195195]  2016-01-06   
 9342644  [197728]  2016-01-11   
 9343091  [190368]  2016-02-08   
 9343127   [68101]  2016-01-05   
 9344544        []  2016-01-31   
 9344892  [114355]  2016-01-19   
 9345338  [225586]  2016-01-11   
 9345414   [95170]  2016-01-15   
 9345701  [332333]  2016-02-03   
 9345784  [341867]  2016-01-25   
 9347848  [120576]  2016-02-11   
 9349307   [54350]  2016-01-11   
 93

In [18]:
# To access dates at which a specific inventor has proposed patents
inventor_to_dates = {}
for t in patents.type.unique():
    inventor_to_dates[t]={}
    for index, row in patents_by_type[t].iterrows():
        for inventor in row.inventors:
            inventor_to_dates[t].setdefault(inventor,[]).append(row.date)
inventor_to_dates

{'utility': {'1383149': ['2016-01-14',
   '2016-02-08',
   '2016-02-18',
   '2016-04-18',
   '2016-06-30',
   '2016-10-12'],
  '1798151': ['2016-01-14',
   '2016-02-11',
   '2016-04-18',
   '2016-06-23',
   '2016-06-30',
   '2016-08-13',
   '2016-09-19'],
  '2171910': ['2016-01-14',
   '2016-04-18',
   '2016-06-23',
   '2016-06-30',
   '2016-07-28',
   '2016-08-13',
   '2016-09-19'],
  '2250909': ['2016-01-14', '2016-04-18', '2016-06-30'],
  '2641780': ['2016-01-27'],
  '2911653': ['2016-01-06'],
  '1333497': ['2016-01-25', '2016-10-20'],
  '745488': ['2016-01-22',
   '2016-03-23',
   '2016-01-28',
   '2016-02-08',
   '2016-02-08',
   '2016-08-03'],
  '2565480': ['2016-01-22', '2016-03-23', '2016-01-28', '2016-08-03'],
  '2945844': ['2016-01-22', '2016-03-23', '2016-08-03'],
  '2737298': ['2016-01-17', '2016-11-21'],
  '2755292': ['2016-01-17'],
  '655437': ['2016-01-11',
   '2016-01-25',
   '2016-01-25',
   '2016-01-25',
   '2016-02-17',
   '2016-02-26',
   '2016-04-21',
   '2016-04-2

In [19]:
from datetime import datetime

In [35]:
# To access inventors that have published at a specific date
date_to_inventors = {}
for t in patents.type.unique():
    date_to_inventors[t]={}
    for index, row in patents_by_type[t].iterrows():
        date = datetime.strptime(row.date,'%Y-%m-%d')
        date_to_inventors[t].setdefault(date,[]).extend(row.inventors)
date_to_inventors

{'utility': {datetime.datetime(2016, 1, 14, 0, 0): ['1383149',
   '1798151',
   '2171910',
   '2250909',
   '80679',
   '1482185',
   '1073625',
   '1430716',
   '2412522',
   '1867977',
   '2090806',
   '10411',
   '71763',
   '1092074',
   '1233565',
   '1255321',
   '2310431',
   '3233034',
   '1284416',
   '1319732',
   '1713860',
   '2062778',
   '2609052',
   '3239669',
   '1944944',
   '549177',
   '1795018',
   '3122372',
   '868213',
   '1214519',
   '1401163',
   '3252987',
   '3252988',
   '3318612',
   '1312113',
   '2274229',
   '3321232',
   '51447',
   '2476625',
   '3130751',
   '3324722',
   '3327116',
   '2799408',
   '3152122',
   '2369975',
   '3330391',
   '3331529',
   '775509',
   '2654668',
   '1746812',
   '1965572',
   '2468710',
   '3152736',
   '3333327',
   '3333328',
   '473446',
   '3021833',
   '2794600',
   '3333353',
   '3333354',
   '2382045',
   '3102773',
   '74029',
   '1519506',
   '401635',
   '844873',
   '992699',
   '2328694',
   '3236553',
  

In [24]:
for k,v in date_to_inventors["design"].items():
    if (k.month == 3):
        print(k)

2016-03-29 00:00:00
2016-03-31 00:00:00
2016-03-12 00:00:00
2016-03-14 00:00:00
2016-03-09 00:00:00
2016-03-17 00:00:00
2016-03-01 00:00:00
2016-03-11 00:00:00
2016-03-28 00:00:00
2016-03-06 00:00:00
2016-03-25 00:00:00
2016-03-23 00:00:00
2016-03-02 00:00:00
2016-03-10 00:00:00
2016-03-15 00:00:00
2016-03-21 00:00:00
2016-03-24 00:00:00
2016-03-22 00:00:00
2016-03-30 00:00:00
2016-03-04 00:00:00
2016-03-07 00:00:00
2016-03-26 00:00:00
2016-03-03 00:00:00
2016-03-18 00:00:00
2016-03-08 00:00:00
2016-03-16 00:00:00
2016-03-19 00:00:00
2016-03-20 00:00:00
2016-03-05 00:00:00
2016-03-13 00:00:00


In [34]:
for k,v in date_to_inventors["design"].items():
    print(v)
    k.month 

['3785089', '3777792', '3785089', '1025730', '2386025', '1728594', '2380400', '2394725', '3792530', '2847958', '2847959', '2311541', '3767342', '3783025', '3794878', '3794879', '3794880', '3794881', '3779475', '3782744', '3796195', '3796295', '372147', '3796720', '3796970', '2776161', '2491975', '2898279', '3466365', '3683713', '3797350', '3797351', '3797898', '3796195', '3798578', '3798651', '3798652', '3732419', '3792268', '3799115', '3799116', '2041443', '3533548', '3533549', '3799115', '3799116', '3778743', '2632580', '2632580', '3800043', '2041443', '3522097', '3533548', '3533549', '3546570', '3800673', '1405882', '3800823', '3800824', '3800825', '3800826', '1907881', '1746187', '811242', '811242', '1683948', '2517380', '3726082', '3801956', '3801957', '3801963', '3005537', '3802051', '3796241', '3802172', '3802173', '3802174', '2041443', '3522097', '3533548', '3533549', '1746187', '3529603', '3352344', '3802759', '3802760', '3352344', '3802759', '3802760', '3352344', '3802759', '

In [62]:
id_list = []

for k,v in date_to_inventors["utility"].items():
    if (k.month == 6):
        id_list.extend([k.day,v])

#id_unique = set(id_list)
#print(id_unique)
print(id_list)

<class 'list'>


In [91]:
for index, row in inventors.iterrows():
    if index==1383149:
    #print(index)
        print(index)

In [123]:

id = 1798151
inventors[inventors.index.values==str(id)]['last_location']

1798151    (29.1416, 119.789)
1798151    (29.1416, 119.789)
1798151    (29.1416, 119.789)
Name: last_location, dtype: object

In [117]:
location_list = []
for k,v in date_to_inventors["utility"].items():
    if (k.month == 6):
        print(k.day)
        print(v)
        for id in v:
            latitude = inventors[inventors.index.values==str(id)]['last_location'][0]
            longitude = inventors[inventors.index.values==str(id)]['last_location'][1]
            location_list[k.day].extend([latitude, longitude])
location_list.sort()

1
['1648946', '2219406', '3155909', '3182724', '3192097', '1391997', '2560380', '3294620', '3294621', '3294622', '1474925', '1612400', '1612315', '1894367', '2929185', '2929186', '3345088', '3242281', '1125581', '2550564', '2644188', '2978061', '3290026', '3290027', '1134402', '3081443', '3081444', '3081445', '1632450', '2339202', '3213273', '3294560', '1693484', '2077437', '3271220', '3292395', '1582740', '1584878', '1619558', '1623811', '3158804', '3365052', '3365053', '3366528', '3366529', '2068578', '1622176', '2492924', '979424', '3199993', '3246163', '1151561', '2669806', '2739103', '3372730', '3372731', '3372732', '167698', '1258177', '1645371', '1839066', '2168206', '3294684', '3379467', '157899', '824129', '2324535', '2110471', '2948720', '2965545', '2965546', '965612', '743555', '857653', '1377910', '2965545', '2965546', '2935928', '3283592', '3283593', '2935928', '3283592', '3283593', '3071430', '3071431', '3071432', '3199549', '2764267', '1608713', '2087378', '3233141', '12

IndexError: index out of bounds

In [None]:
from folium import plugins

map_hooray = folium.Map(earth_center,
                    zoom_start = 1) 

# Ensure you're handing it floats
df_acc['Latitude'] = inventors['last_location'][0].astype(float)
df_acc['Longitude'] = inventors['last_location'][1].astype(float)

# Filter the DF for rows, then columns, then remove NaNs
heat_df = df_acc[df_acc['Speed_limit']=='40'] # Reducing data size so it runs faster
heat_df = heat_df[heat_df['Year']=='2007'] # Reducing data size so it runs faster
heat_df = heat_df[['Latitude', 'Longitude']]

# Create weight column, using date
heat_df['Weight'] = df_acc['Date'].str[3:5]
heat_df['Weight'] = heat_df['Weight'].astype(float)
heat_df = heat_df.dropna(axis=0, subset=['Latitude','Longitude', 'Weight'])

# List comprehension
to make out list of lists
heat_data = [[[row['Latitude'],row['Longitude']] for index, row in heat_df[heat_df['Weight'] == i].iterrows()] for i in range(0,13)]

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.8)
hm.add_to(map_hooray)
# Display the map
map_hooray

### Q2/ How can we best identify and visualize different geographical innovation networks? Can we estimate the number of people in such networks?
_Q2.1/ If we then take a few examples of different types of companies and look at the network of patents supporting their own patents, will these networks match up with the former innovation networks, or will they be more self-contained? In the latter case, can we estimate the number of people that make up these networks? Are these innovation networks concentrated around specific areas, or are they spread out ?  
Q2.2/ Do similar companies use the same knowledge bases to innovate? For example, if we look at different social networking companies, will the networks supporting their patents be distinct? Will a given companies patents mostly cite their own previous patents, or will they tap outside innovation networks? On what scale?  
Q2.3/ What about if we look at university/academic knowledge bases and compare them with those of the companies analyzed above?   
Q2.4/ What about governmental or non-governmental organizations, or international agencies?    
Q2.5/ How have the innovation networks identified above evolved through time?_ 

### Approach Q2
- For the geographically focused research question, a natural approach can be to explore the data using maps to visualize e.g. the innovation networks.  
- In HW2 we worked with **Folium**, which can be one interesting tool here, as we have access to clean longitude- and latitude data (see preprocessing.ipynb). Folium has e.g. cluster-functionality, which could be a way to quantify the magnitude of the networks in different geographic zones. 
    - Folium also gives us the option to add interactivity.
- To represent networks, an alternative could be to use a graph-based approach, where relationships between entities within a network could be represented by nodes and edges. The Python library **networkx**, which enables the creation of Network Graphs, is one option. 
    - Network representations could also be relevant to answer question 2.2, when exploring the citation of patents from similar companies. 
- Throughout the data story, the usage of **interactive graphs** could be particularly efficient in making the plots self-sufficient and to invite exploration by the reader. Adding the element of time can enable comparing the development of innovation networks, and by hovering over parts of the map the reader can obtain more detailed information. 
    - The Python library **HoloViews** can, alongside networkx, enable the creation of interactive network graphs. 
    - Another interactive visualization library to explore is **Bokeh**.
- Questions Q2.1-2. could provide tangible examples for how prevalent (or not) patent networks are for well-known companies.
- By following these results with those of questions Q2.3-4 we believe it could provide a good overview into the similarities and/or differences between the three major categories of organizations (academia, non-governmental and governmental organizations), hence being of interest to readers within various field, and also (hopefully) providing input to our peers (students) that might be interested in research and innovation, and that might contemplate various possible tracks for their careers.  

In summary, the visualization techniques, as well as the amount of interactivity to add, if any, we decide upon will have a big impact on the quality of the data story. This latter is intended to be eventually recounted on a website spitted out by Jekyll, a static site generator.