## Motivation

- The purpose of this part is to present an initial outline for the tools that can help us in the creation of the Data Story, based on the work done thus far. 
- By iteratively updating, and keeping the outline in mind, we believe it will help us to reflect around how we can present our results in an effective manner throughout the data analysis pipeline. This approach will allow us to keep an open mind to finding new insights throughout the project.  
- On the other hand, being conscious of the data story can also enable us to be more structured in our data analysis process, by developing the story we want to tell and finding the data that can support it. 

## Structure
- The initial outline will be focused on how we could approach communicating the results for the research questions as well as a first outline for in which order it could be natural to present them. 
- Further iterations will aim at becoming more and more focused on creating a coherent story, that is following a well thought-through order to improve readability.

### Q1/ What does the typical patent-holder look like today (Corporation, Universities, Governments, Individuals), and how has that evolved throughout time / geographies?
_Q1.1/ Is a migration of innovators through time visible in the data, e.g. a convergence towards certain innovation centers?  
Q1.2 / How has the number of assignees and inventors evolved through time for different patent types? Are there significant differences in these numbers between different geographies?_  

Get all the patents data for the year 2016.

In [16]:
# data processing
import pandas as pd
import numpy as np
from pipeline import get_full_year_data, load_preprocessed_data, get_layers_data, load_layers

# visualizations
from folium import folium, plugins
import matplotlib as mpl
import ipywidgets

**Define path to save all data files**

In [2]:
MY_PATH = '/media/dcm/HDD/ADA_DATA'

**Getting data for Q1**

In [3]:
full_datafile_lists = {}
full_year_data = {}
year_range = range(2000,2003 + 1)

for year in year_range:
    full_datafile_lists[str(year)] = get_full_year_data(str(year), MY_PATH)

/media/dcm/HDD/ADA_DATA 2000q1
already on file
/media/dcm/HDD/ADA_DATA 2000q2
already on file
/media/dcm/HDD/ADA_DATA 2000q3
already on file
/media/dcm/HDD/ADA_DATA 2000q4
already on file
/media/dcm/HDD/ADA_DATA 2001q1
already on file
/media/dcm/HDD/ADA_DATA 2001q2
already on file
/media/dcm/HDD/ADA_DATA 2001q3
already on file
/media/dcm/HDD/ADA_DATA 2001q4
already on file
/media/dcm/HDD/ADA_DATA 2002q1
already on file
/media/dcm/HDD/ADA_DATA 2002q2
already on file
/media/dcm/HDD/ADA_DATA 2002q3
already on file
/media/dcm/HDD/ADA_DATA 2002q4
already on file
/media/dcm/HDD/ADA_DATA 2003q1
already on file
/media/dcm/HDD/ADA_DATA 2003q2
already on file
/media/dcm/HDD/ADA_DATA 2003q3
already on file
/media/dcm/HDD/ADA_DATA 2003q4
already on file


In [4]:
for year in year_range:
    full_year_data[str(year)] = load_preprocessed_data(full_datafile_lists[str(year)])

**Example : getting data for Q2**

In [None]:
example_patent_file = get_layers_data('apple_example', MY_PATH, ['9430098'], 4)

In [None]:
apple_example_data = load_layers(example_patent_file)

In [None]:
len(apple_example_data['0']['inventors']), len(apple_example_data['1']['inventors']), len(apple_example_data['2']['inventors']), len(apple_example_data['3']['inventors'])

### Approach Q1
- Question 1 can be a natural way to progress after e.g. a short introduction about patents itself, providing the reader with insight into the main sources of patent creation as well as the potentially historic overview of the same metric.  
- Q1.1 focuses on how data evolves through time from a geographic standpoint, and using  maps to visualize the migration could be a good approach. Below (Q2) interactive graphs are discussed and this could be one alternative here as well. 
    - One approach is to use several maps from ascending chronological periods of time where the reader could see if patterns of innovation centers emerge
    - Another approach could be an interactive chart that enables toggling between different time periods by the reader, enabling comparison by using a single map.
- Q1.2. can potentially provide a good segway into what we brought up in the Project's abstract: _"While innovation is often portrayed as the product of either one charismatic leader - or a ragtag team of geniuses - in reality we suspect that innovations, however important, happen in small steps supported by large networks of people."_
    - This will also be addressed in more detail in research question 2.
    - Various barplots or line plots could provide helpful tools for answering the first part of the question, while the same tools might also be an initial way to approach the second part, after segmenting the data into geographies. Maps could also be explored here.

In [5]:
min_year = list(full_year_data.keys())[0]
max_year = list(full_year_data.keys())[-1]

In [68]:
def get_chart_color(patent_type):
    
    if patent_type == 'design' :
        return (47/255, 147/255, 147/255)
    
    if patent_type == 'plant' :
        return (151/255, 179/255, 100/255)
    
    if patent_type == 'utility' :
        return (36/255, 93/255, 147/255)

In [69]:
def get_ts():
    num_patents_ts = []
    num_inventors_ts = []
    num_citations_ts = []
    num_utility_patents_ts = []
    num_design_patents_ts = []
    prop_us_private_ts = []
    prop_nonus_private_ts = []
    
    for year in full_year_data:
        num_patents_ts.append(full_year_data[year]['num_patents'])
        num_inventors_ts.append(full_year_data[year]['num_inventors'])
        num_citations_ts.append(full_year_data[year]['num_citations'])
        num_utility_patents_ts.append(full_year_data[year]['proportion_by_patent_type'].loc['utility'] * num_patents_ts[-1])
        num_design_patents_ts.append(full_year_data[year]['proportion_by_patent_type'].loc['design'] * num_patents_ts[-1])
        prop_us_private_ts.append(full_year_data[year]['proportion_by_assignee_type'].loc['2'])
        prop_nonus_private_ts.append(full_year_data[year]['proportion_by_assignee_type'].loc['3'])
        
    num_patents_ts = np.array(num_patents_ts).flatten() / 1000
    num_inventors_ts = np.array(num_inventors_ts).flatten() / 1000
    num_citations_ts = np.array(num_citations_ts).flatten() / 1000
    num_utility_patents_ts = np.array(num_utility_patents_ts).flatten() / 1000
    num_design_patents_ts = np.array(num_design_patents_ts).flatten() / 1000
    prop_us_private_ts = np.array(prop_us_private_ts).flatten()
    prop_nonus_private_ts = np.array(prop_nonus_private_ts).flatten()
    
    return np.array(num_patents_ts), np.array(num_inventors_ts), np.array(num_citations_ts), np.array(num_utility_patents_ts), np.array(num_design_patents_ts), np.array(prop_us_private_ts), np.array(prop_nonus_private_ts)

In [70]:
patents_ts, inventors_ts, citations_ts, utility_ts, design_ts, us_ts, nonus_ts = get_ts()

In [148]:
def fig_param(years, viz) :
    
    year1 = years[0]
    year2 = years[1]
    
    year_range = np.array(range(int(min_year), int(max_year) + 1))
    selected_range = np.array(range(int(year1), int(year2) + 1))
    patent_type_dist_y1 = full_year_data[str(year1)]['proportion_by_patent_type'].sort_values(by=0, ascending = False).iloc[:3,:]
    patent_type_dist_y2 = full_year_data[str(year2)]['proportion_by_patent_type'].sort_values(by=0, ascending = False).iloc[:3,:]

    if viz == 'Overview' :
        
        fig = mpl.pyplot.figure()
        gs = mpl.gridspec.GridSpec(4, 2, wspace=0.5, hspace=0, height_ratios=[2, 1, 0.75, 0.75]) # 5x2 grid
        ax0 = fig.add_subplot(gs[0, 0]) # first row, first col
        ax1 = fig.add_subplot(gs[0, 1]) # first row, second col
        ax2 = fig.add_subplot(gs[1, :]) # full second row
        ax3 = fig.add_subplot(gs[2, :]) # full third row
        ax4 = fig.add_subplot(gs[3, :]) # full fourth row
        
        """
        PLOT PIE CHARTS FOR SELECTED YEARS
        """
        patches1, texts1, autotexts1 = ax0.pie(patent_type_dist_y1.values, labels = patent_type_dist_y1.index.values, 
                                               colors = [get_chart_color(p) for p in patent_type_dist_y1.index.values], 
                                               autopct='%1.1f%%', shadow=True)
        ax0.set_title(year1, size = 30, pad=-10)

        patches2, texts2, autotexts2 = ax1.pie(patent_type_dist_y2.values, labels = patent_type_dist_y2.index.values, 
                                               colors = [get_chart_color(p) for p in patent_type_dist_y2.index.values], 
                                               autopct='%1.1f%%', shadow=True)
        ax1.set_title(year2, size = 30, pad=-10)

        # FONT SIZE
        for text in np.hstack([texts1, texts2, autotexts1, autotexts2]):
            text.set_fontsize(20)

        """
        PLOT LINE CHARTS
        """
        ax2.set_title("All Numbers in [000's]", size = 20)
        # SELECTED RANGE INDICES
        start = sum(selected_range[0] > year_range)
        end = len(selected_range) + start

        # BASE PATENTS
        ax2.stackplot(year_range, utility_ts, design_ts, patents_ts - utility_ts - design_ts,
                      colors = ['#CCE9FF','#DEFFEE','#FFEDF3'])
        # HIGHLIGHTED SELECTION (YEAR RANGE)
        ax2.stackplot(selected_range, utility_ts[start:end], design_ts[start:end], patents_ts[start:end] - utility_ts[start:end] - design_ts[start:end],
                      labels = ['Number of Utility Patents', 'Design Patents', 'Other Patents'],
                      colors = ['#377EB8','#55BA87','#7E1137'])

        ax2.set_ylim(bottom=190)

        # INVENTORS AND CITATIONS
        ax3.plot(year_range, inventors_ts, linewidth = 2, color = '#CCE9FF')
        ax4.plot(year_range, citations_ts, linewidth = 2, color = '#DEFFEE')
        # HIGHLIGHTED SELECTION (YEAR RANGE)
        ax3.plot(selected_range, inventors_ts[start:end], label = 'Number of Inventors', linewidth = 3, color = '#377EB8')
        ax4.plot(selected_range, citations_ts[start:end], label = 'Number of Citations',linewidth = 3, color = '#55BA87')

        # FORMAT PLOTS
        #fig.patch.set_visible(False)
        ax3.yaxis.tick_right()
        ax4.set_xticks(year_range)

        # despine
        for i, a in enumerate([ax2, ax3, ax4]):
            a.grid(axis = 'y')
            a.tick_params(axis='y', which='both',length=0)
            leg = a.legend(prop = {'size' : 14}, frameon = False, loc = 2)
            leg.get_frame().set_linewidth(0.0)
            for spine in ["top", "right", "bottom"]:
                a.spines[spine].set_visible(False)
                
        fig.set_size_inches(14,12)
        mpl.pyplot.show()
    
    if viz == 'Utility Patents':

        fig = mpl.pyplot.figure()
        gs = mpl.gridspec.GridSpec(1, 2, wspace=0.5, hspace=0, height_ratios=[0.5]) # 1x2 grid
        ax0 = fig.add_subplot(gs[0, 0]) # first row, first col
        ax1 = fig.add_subplot(gs[0, 1]) # first row, second col
        
        # top_assignees = full_year_data['2000']['inventors'].groupby(by='assignee').size().sort_values(ascending=False).keys().values[:10]

        # draw map
        map_ = folium.Map((30,15), zoom_start = 1.5)
        hm = plugins.HeatMap(full_year_data[str(year1)]['inventors'][['latitude','longitude']].values[:1000,:],
                     radius = 10)
        hm.add_to(map_)

        fig.set_size_inches(14,4)
        mpl.pyplot.show()

        display(map_)

In [152]:
style = {'description_width': 'initial'}

year_widget = ipywidgets.IntRangeSlider(
    value=[int(min_year), int(max_year)],
    min=int(min_year),
    max=int(max_year),
    step=1,
    description='Select Time:',
    disabled=False,
    orientation='horizontal',
    style = style,
    layout= ipywidgets.Layout(width='60%', height='20px', margin='10px')
)

selection_widget = ipywidgets.ToggleButtons(
    options=['Overview','Utility Patents','Design Patents','Plant Patents'],
    description='Select Visualization:',
    button_style='',
    layout = ipywidgets.Layout(margin='10px'),
    style = style
)

In [153]:
q1_widget = ipywidgets.interactive(fig_param, years = year_widget, viz = selection_widget)

In [154]:
q1_widget

interactive(children=(IntRangeSlider(value=(2000, 2003), description='Select Time:', layout=Layout(height='20p…

### Q2/ How can we best identify and visualize different geographical innovation networks? Can we estimate the number of people in such networks?
_Q2.1/ If we then take a few examples of different types of companies and look at the network of patents supporting their own patents, will these networks match up with the former innovation networks, or will they be more self-contained? In the latter case, can we estimate the number of people that make up these networks? Are these innovation networks concentrated around specific areas, or are they spread out ?  
Q2.2/ Do similar companies use the same knowledge bases to innovate? For example, if we look at different social networking companies, will the networks supporting their patents be distinct? Will a given companies patents mostly cite their own previous patents, or will they tap outside innovation networks? On what scale?  
Q2.3/ What about if we look at university/academic knowledge bases and compare them with those of the companies analyzed above?   
Q2.4/ What about governmental or non-governmental organizations, or international agencies?    
Q2.5/ How have the innovation networks identified above evolved through time?_ 

### Approach Q2
- For the geographically focused research question, a natural approach can be to explore the data using maps to visualize e.g. the innovation networks.  
- In HW2 we worked with **Folium**, which can be one interesting tool here, as we have access to clean longitude- and latitude data (see preprocessing.ipynb). Folium has e.g. cluster-functionality, which could be a way to quantify the magnitude of the networks in different geographic zones. 
    - Folium also gives us the option to add interactivity.
- To represent networks, an alternative could be to use a graph-based approach, where relationships between entities within a network could be represented by nodes and edges. The Python library **networkx**, which enables the creation of Network Graphs, is one option. 
    - Network representations could also be relevant to answer question 2.2, when exploring the citation of patents from similar companies. 
- Throughout the data story, the usage of **interactive graphs** could be particularly efficient in making the plots self-sufficient and to invite exploration by the reader. Adding the element of time can enable comparing the development of innovation networks, and by hovering over parts of the map the reader can obtain more detailed information. 
    - The Python library **HoloViews** can, alongside networkx, enable the creation of interactive network graphs. 
    - Another interactive visualization library to explore is **Bokeh**.
- Questions Q2.1-2. could provide tangible examples for how prevalent (or not) patent networks are for well-known companies.
- By following these results with those of questions Q2.3-4 we believe it could provide a good overview into the similarities and/or differences between the three major categories of organizations (academia, non-governmental and governmental organizations), hence being of interest to readers within various field, and also (hopefully) providing input to our peers (students) that might be interested in research and innovation, and that might contemplate various possible tracks for their careers.  

In summary, the visualization techniques, as well as the amount of interactivity to add, if any, we decide upon will have a big impact on the quality of the data story. This latter is intended to be eventually recounted on a website spitted out by Jekyll, a static site generator.