## Motivation

- The purpose of this part is to present an initial outline for the tools that can help us in the creation of the Data Story, based on the work done thus far. 
- By iteratively updating, and keeping the outline in mind, we believe it will help us to reflect around how we can present our results in an effective manner throughout the data analysis pipeline. This approach will allow us to keep an open mind to finding new insights throughout the project.  
- On the other hand, being conscious of the data story can also enable us to be more structured in our data analysis process, by developing the story we want to tell and finding the data that can support it. 

## Structure
- The initial outline will be focused on how we could approach communicating the results for the research questions as well as a first outline for in which order it could be natural to present them. 
- Further iterations will aim at becoming more and more focused on creating a coherent story, that is following a well thought-through order to improve readability.

### Q1/ What does the typical patent-holder look like today (Corporation, Universities, Governments, Individuals), and how has that evolved throughout time / geographies?
_Q1.1/ Is a migration of innovators through time visible in the data, e.g. a convergence towards certain innovation centers?  
Q1.2 / How has the number of assignees and inventors evolved through time for different patent types? Are there significant differences in these numbers between different geographies?_  

Get all the patents data for the year 2016.

In [1]:
import pandas as pd
import numpy as np
import os
import json
from pipeline import patentsviewAPI, data_clean

In [2]:
def get_full_year_data(year):
    filenames = [year + q for q in ['q1','q2','q3','q4']]
    date_from = [year + '-' + date for date in ['01-01','04-01','07-01','10-01']]
    date_to = [year + '-' + date for date in ['03-31','06-30','09-30','12-31']]
    filepath = 'data'

    data = {}
    
    for i,filename in enumerate(filenames):
        print(filepath,filename)
        if os.path.isfile(os.path.join(filepath,filename + '.json')):
            datafile = os.path.join(filepath,filename + '.json')
            data[filename] = data_clean(datafile)
            print('loaded from file')
        else:
            datafile = patentsviewAPI(filename, filepath = filepath, app_date_from = date_from[i], app_date_to = date_to[i])
            data[filename] = data_clean(datafile)
            
    for i, dataset in enumerate(data):
        if i == 0:
            assignees = data[dataset]['assignees'].copy()
            inventors = data[dataset]['inventors'].copy()
            patents = data[dataset]['patents'].copy()
            citations = data[dataset]['citations'].copy()
        else:
            assignees = assignees.append(data[dataset]['assignees'])
            inventors = inventors.append(data[dataset]['inventors'])
            patents = patents.append(data[dataset]['patents'])
            #citations = citations.merge(data[dataset]['citations'], how = 'outer').sum(axis=1)

    assignees.drop_duplicates(inplace=True)
    inventors.drop_duplicates(inplace=True)
        
    return assignees, inventors, patents, citations

Get all 2016 data

In [3]:
assignees, inventors, patents, citations = get_full_year_data('2016')

data 2016q1
loaded from file
data 2016q2
loaded from file
data 2016q3
loaded from file
data 2016q4
loaded from file


Get all names with 'samsung'

In [4]:
samsung_assignee_numbers = []
for row, val in enumerate(assignees.values):
    if val[1]:
        if 'samsung' in val[1].lower():
            samsung_assignee_numbers.append(assignees.index.values[row])

get all samsung patent_numbers

In [5]:
samsung_patents_2016 = []
for num in samsung_assignee_numbers:
    for i, assignee in enumerate(list(patents.assignees.values)):
        if num in assignee:
            samsung_patents_2016.append(patents.index.values[i])

get all cited patents by samsung patents

In [6]:
def get_patent_citations(patent_numbers, filename):
    i = 0
    while ((i+1)*100 < len(patent_numbers)):
        datafile = patentsviewAPI('temp' + str(i) + '.json', 'data', patent_number = patent_numbers[i*100:(i+1)*100])
        clean_data = data_clean(datafile)
        
        # here you can extract all the assignees, inventors and patents data
        # for now I just extract the cited patent, to check
        if i == 0:
            data = clean_data['citations'].copy()
        else:
            data = data.append(clean_data['citations'])
        i += 1

    datafile = patentsviewAPI('temp' + str(i) + '.json', 'data', patent_number = patent_numbers[i*100:])
    clean_data = data_clean(datafile)
    data = data.append(clean_data['citations'])

    # would be a good idea to add a line to save the data, to not have to query it again if we need it later
    
    return data

In [7]:
samsung_cited_patents_2016 = get_patent_citations(samsung_patents_2016,'samsung_cited_patents_2016.json')

1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1119
fetching first page
saving data
1

In [8]:
samsung_cited_patents_2016

Unnamed: 0,count
8559235,7
7679133,7
8553466,7
8654587,7
7940350,4
7002182,4
7084420,4
7087932,4
7154124,4
7208725,4


In [10]:
len(samsung_patents_2016)

7302

In [11]:
len(samsung_cited_patents_2016)

50628

### Approach Q1
- Question 1 can be a natural way to progress after e.g. a short introduction about patents itself, providing the reader with insight into the main sources of patent creation as well as the potentially historic overview of the same metric.  
- Q1.1 focuses on how data evolves through time from a geographic standpoint, and using  maps to visualize the migration could be a good approach. Below (Q2) interactive graphs are discussed and this could be one alternative here as well. 
    - One approach is to use several maps from ascending chronological periods of time where the reader could see if patterns of innovation centers emerge
    - Another approach could be an interactive chart that enables toggling between different time periods by the reader, enabling comparison by using a single map.
- Q1.2. can potentially provide a good segway into what we brought up in the Project's abstract: _"While innovation is often portrayed as the product of either one charismatic leader - or a ragtag team of geniuses - in reality we suspect that innovations, however important, happen in small steps supported by large networks of people."_
    - This will also be addressed in more detail in research question 2.
    - Various barplots or line plots could provide helpful tools for answering the first part of the question, while the same tools might also be an initial way to approach the second part, after segmenting the data into geographies. Maps could also be explored here.

### Q2/ How can we best identify and visualize different geographical innovation networks? Can we estimate the number of people in such networks?
_Q2.1/ If we then take a few examples of different types of companies and look at the network of patents supporting their own patents, will these networks match up with the former innovation networks, or will they be more self-contained? In the latter case, can we estimate the number of people that make up these networks? Are these innovation networks concentrated around specific areas, or are they spread out ?  
Q2.2/ Do similar companies use the same knowledge bases to innovate? For example, if we look at different social networking companies, will the networks supporting their patents be distinct? Will a given companies patents mostly cite their own previous patents, or will they tap outside innovation networks? On what scale?  
Q2.3/ What about if we look at university/academic knowledge bases and compare them with those of the companies analyzed above?   
Q2.4/ What about governmental or non-governmental organizations, or international agencies?    
Q2.5/ How have the innovation networks identified above evolved through time?_ 

### Approach Q2
- For the geographically focused research question, a natural approach can be to explore the data using maps to visualize e.g. the innovation networks.  
- In HW2 we worked with **Folium**, which can be one interesting tool here, as we have access to clean longitude- and latitude data (see preprocessing.ipynb). Folium has e.g. cluster-functionality, which could be a way to quantify the magnitude of the networks in different geographic zones. 
    - Folium also gives us the option to add interactivity.
- To represent networks, an alternative could be to use a graph-based approach, where relationships between entities within a network could be represented by nodes and edges. The Python library **networkx**, which enables the creation of Network Graphs, is one option. 
    - Network representations could also be relevant to answer question 2.2, when exploring the citation of patents from similar companies. 
- Throughout the data story, the usage of **interactive graphs** could be particularly efficient in making the plots self-sufficient and to invite exploration by the reader. Adding the element of time can enable comparing the development of innovation networks, and by hovering over parts of the map the reader can obtain more detailed information. 
    - The Python library **HoloViews** can, alongside networkx, enable the creation of interactive network graphs. 
    - Another interactive visualization library to explore is **Bokeh**.
- Questions Q2.1-2. could provide tangible examples for how prevalent (or not) patent networks are for well-known companies.
- By following these results with those of questions Q2.3-4 we believe it could provide a good overview into the similarities and/or differences between the three major categories of organizations (academia, non-governmental and governmental organizations), hence being of interest to readers within various field, and also (hopefully) providing input to our peers (students) that might be interested in research and innovation, and that might contemplate various possible tracks for their careers.  

In summary, the visualization techniques, as well as the amount of interactivity to add, if any, we decide upon will have a big impact on the quality of the data story. This latter is intended to be eventually recounted on a website spitted out by Jekyll, a static site generator.