# Data modelling

**We now have a long pipeline with a many different classes. This adds complexity and maintenance without any gains!**

https://github.com/tirthajyoti/Machine-Learning-with-Python/blob/master/OOP_in_ML/Class_MyLinearRegression.ipynb

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Get-data-extracted-from-PATSTAT" data-toc-modified-id="Get-data-extracted-from-PATSTAT-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Get data extracted from PATSTAT</a></span></li><li><span><a href="#Classes-and-methods" data-toc-modified-id="Classes-and-methods-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Classes and methods</a></span><ul class="toc-item"><li><span><a href="#General-configuration-of-the-model" data-toc-modified-id="General-configuration-of-the-model-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>General configuration of the model</a></span></li><li><span><a href="#Data-pre-processing" data-toc-modified-id="Data-pre-processing-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Data pre-processing</a></span></li><li><span><a href="#Reshaping-the-data-in-Object-Oriented-fashion" data-toc-modified-id="Reshaping-the-data-in-Object-Oriented-fashion-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Reshaping the data in Object Oriented fashion</a></span><ul class="toc-item"><li><span><a href="#Patent-class" data-toc-modified-id="Patent-class-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Patent class</a></span></li><li><span><a href="#Reshaping-to-OOP-methods" data-toc-modified-id="Reshaping-to-OOP-methods-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Reshaping to OOP methods</a></span></li></ul></li></ul></li><li><span><a href="#Model" data-toc-modified-id="Model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model</a></span></li></ul></div>

In [5]:
import pandas as pd

## Get data extracted from PATSTAT

We load the data previously extracted from PATSTAT and stored in 5 csv files.

In [31]:
output_files_prefix = "wind_tech"
pre = '../data/raw/' + output_files_prefix
suf = '.csv'
        
TABLE_MAIN_PATENT_INFOS = pd.read_csv(pre + '_table_main_patent_infos' + suf)
TABLE_CPC = pd.read_csv(pre + '_table_cpc' + suf)
TABLE_PATENTEES_INFO = pd.read_csv(pre + '_table_patentees_info' + suf)
TABLE_DOCDB_BACKWARD_CITATIONS = pd.read_csv(pre + '_table_backward_docdb_citations' + suf)
TABLE_DOCDB_FORWARD_CITATIONS = pd.read_csv(pre + '_table_forward_docdb_citations' + suf)

For convenience, we store all the data retrieved into a "data" object.

In [32]:
data = {'_table_main_patent_infos': TABLE_MAIN_PATENT_INFOS,
       '_table_cpc': TABLE_CPC, 
       '_table_patentees_info': TABLE_PATENTEES_INFO,
       '_table_backward_docdb_citations': TABLE_DOCDB_BACKWARD_CITATIONS,
       '_table_forward_docdb_citations': TABLE_DOCDB_FORWARD_CITATIONS}

## Classes and methods

### General configuration of the model
For clarity, we store all the constant parameters of the model in a Config class.

In [42]:
class Config:
    """Contains the configuration of the data_model"""
    
    # Magic numbers
    LAST_YEAR_TO_RECEIVE_CITAITONS = 2018

    # PASTAT_variables 
    VAR_APPLN_ID = 'appln_id'
    VAR_DOCDC_FAMILY_ID = 'docdb_family_id'
    VAR_CITED_DOCDB_FAM_ID = 'cited_docdb_family_id'
    VAR_APPLN_FILLING_YEAR = 'appln_filing_year'
    VAR_NB_CITING_DOCDB_FAM = 'nb_citing_docdb_fam'
    VAR_EARLIEST_FILLING_DATE = 'earliest_filing_date'
    VAR_EARLIEST_FILING_YEAR = 'earliest_filing_year'

    # Computed variables
    NEW_VAR_CITING_DOCDB_FAM_IDS = 'citing_docdb_families_ids'
    NEW_VAR_NB_CITING_DOCDB_FAM_BY_YEAR = 'nb_citing_docdb_fam_by_year'

### Data pre-processing
This class contains all the pre-processing methods needed to filter and reshape the data before starting the analysis. The methods are then called from the model itself.

In [43]:
class DataPreProcessing:
    """Methods to clean the data retrieved from PATSTAT and the EP full-text database and
    to compute some variables"""
    
    def __init__(self, config):
        self.config = config
    
    
    def _compute_fam_citations_by_year(self, data):
        """
        # Number of patent family citations received by year
        """
        
        # Unpacking some variables for clarity
        df = data['_table_main_patent_infos']
        citations_by_year = self.config.NEW_VAR_NB_CITING_DOCDB_FAM_BY_YEAR
        citations_docdb_fam = self.config.VAR_NB_CITING_DOCDB_FAM
        year = self.config.VAR_APPLN_FILLING_YEAR
        ref_year = self.config.LAST_YEAR_TO_RECEIVE_CITAITONS
        
        print('-> Adding the number of patent family citations received by year')
        df[citations_by_year] = df[citations_docdb_fam]/(ref_year-df[year]) 
        
        data['_table_main_patent_infos'] = df
        return data
    
    
    def _normalize(self):
        pass

### Reshaping the data in Object Oriented fashion
The input data takes the form of tabular data, but the network structure that we aim at is very different. As an intermediate step, we store all the data in Patent objects, since patents are our main unit of analysis.

#### Patent class
We logically create a patent object. Since the patent will have a long list of attributes, we stored their attributes in a dictionnary. As a shortcut, we store the main patent key `appln_id` as an attribute direclty accesible with `patent.appln_id`.

In [44]:
class Patent:
    
    # Attributes
    patent_attributes = {} # Contains the list of the patent's attributes
    appln_id:int # as a shortcut we  store the main patent key
    
    def __init__(self, appln_id):
        """Setting the patent parameters"""
        self.patent_attributes.update({param.VAR_APPLN_ID :  appln_id})
        self.appln_id = appln_id 

#### Reshaping to OOP methods
We pack all methods allowing to reshape the data in an OOP fashion in a dedicated class.

In [45]:
class ReshapingToOOP:
    """Methods to assign the data to patent objects"""
    
    def __init__(self):
        pass
    
    
    def _create_patent_objects(self):
        """"""
        pass
    
    
    def _assign_data_to_patent_obj(self):
        """
        Once the data has been retrieved from PATSTAT and the patent objects have been created,
        we assign the data to the Patent objects
        """
        
        
    def snippet_store_patent_attributes(self, table):
        """
        Code snippet to dynamically store attributes 
        from a Pandas table in a dictionnary
        # If a value has several values, then ts stored in a list
        """

In [46]:
class NetworkBuilding():
    """Methods to compute direct and indirect (BC, CC, LC) citations between the patents"""
        
    def __get_direct_citations():
        """"""
        pass
        
    def __get_BC_citations():
        """"""
        pass
        
    def __get_CC_citations():
        """"""
        pass
        
    def __get_LC_citations():
        """"""
        pass

In [47]:
class TextProcessing:
    """Methods for text analysis and similarity measures"""

## Model

In [73]:
class Model:
    """Model"""
    
    config: Config
    data_raw: list
    data: list
    
    
    def __init__(self, config):
        self.config = config
    
    
    def _input_data(self, data):
        """Assign the input data to the model"""
        self.data_raw = data
        
        
    def _data_pre_processing(self):
        """Pre-processing of the data"""
        data = self.data_raw 
        preprocessor = DataPreProcessing(self.config)
        
        # Using the methods from the DataPreProcessing class
        self.data = preprocessor._compute_fam_citations_by_year(data)
        #data = preprocessor._normalize(data)
        
        

    def _reshape_OOP(self, data):
        """Reshaping the data from the tabular form to an OOP form"""
        pass
    
    
    def _build_network(self, data):
        pass
    
    
    def _text_processing(self, data):
        pass
    
    
    def _compute_LSA(self):
        pass
    
    
    def _compute_similiary_measure(self):
        pass
    
    
    def _create_network(self):
        pass
    
    
    def _create_static_network_over_time(self):
        pass
    
    
    def _detect_communities_static_network(self):
        pass
    
    
    def _trace_communities_dynamic_network(self):
        pass

In [74]:
config = Config()

In [75]:
model = Model(config)
model._input_data(data)

In [76]:
model._data_pre_processing()

-> Adding the number of patent family citations received by year


In [78]:
model.data['_table_main_patent_infos']

Unnamed: 0,appln_id,appln_id.1,appln_auth,appln_nr,appln_kind,appln_filing_date,appln_filing_year,appln_nr_epodoc,appln_nr_original,ipr_type,...,ipc_class_symbol,ipc_class_level,ipc_version,ipc_value,ipc_position,ipc_gener_auth,appln_id.5,nace2_code,weight,nb_citing_docdb_fam_by_year
0,44954682,44954682,TR,32490,A,1990-04-20,1990,TR19900000324,00324/90,PI,...,F03D 3/00,A,2006-01-01,I,,EP,44954682.0,28.1,1.0,0.000000
1,7233424,7233424,CN,91221168,U,1991-11-18,1991,CN1991221168U,91221168,UM,...,F03D 3/00,A,2006-01-01,I,,EP,7233424.0,28.1,1.0,0.000000
2,11177052,11177052,DE,4040411,A,1990-12-18,1990,DE19904040411,4040411,PI,...,B64C 11/00,A,2006-01-01,I,,EP,11177052.0,30.0,0.5,0.071429
3,11177052,11177052,DE,4040411,A,1990-12-18,1990,DE19904040411,4040411,PI,...,B64C 11/00,A,2006-01-01,I,,EP,11177052.0,28.1,0.5,0.071429
4,11177052,11177052,DE,4040411,A,1990-12-18,1990,DE19904040411,4040411,PI,...,F03D 1/06,A,2006-01-01,I,,EP,11177052.0,30.0,0.5,0.071429
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2465,2176567,2176567,AU,6535590,A,1990-08-30,1990,AU19900065355,6535590,PI,...,B64C 5/08,A,2006-01-01,I,,EP,2176567.0,30.0,0.8,0.785714
2466,2176567,2176567,AU,6535590,A,1990-08-30,1990,AU19900065355,6535590,PI,...,B64C 5/08,A,2006-01-01,I,,EP,2176567.0,28.1,0.2,0.785714
2467,2176567,2176567,AU,6535590,A,1990-08-30,1990,AU19900065355,6535590,PI,...,F03D 11/00,A,2006-01-01,I,,EP,2176567.0,30.0,0.8,0.785714
2468,2176567,2176567,AU,6535590,A,1990-08-30,1990,AU19900065355,6535590,PI,...,F03D 11/00,A,2006-01-01,I,,EP,2176567.0,28.1,0.2,0.785714
