# In this notebook, we will be finding the word embeddings from the data file we found on kaggle using word2vec

here are some goals that I want to accomplish by this week
 - clean our granted patent dataset to only have utility work, and have only the full_patent, patent date, and patent id columns
 - clean our patent cpc data so that it can be used in tandem with our granted patent dataset
 - clean our kaggle dataset so that it only has relevant columns pertaining to word embeddings and timeline knowledge

## Imports and Reading in Files

In [None]:
# Here, we are going to load the generic g_patent and g_cpc_current DataFrames as well as our pandas, matplotlib etc
import pandas as pd
import gensim
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set( style = 'white' )

# Here is a variable change in the width of the string for the columns. I am placing it here as it may need
# to be changed often, so I will move it right here!
pd.options.display.max_colwidth = 1000

In [2]:
# here is the granted patent dataset that we will use alongside the job posting abstracts
df_patent = pd.read_csv("g_patent.tsv", delimiter='\t', dtype={'patent_id': str, 
                                                        'patent_type': str, 
                                                        'patent_title': str,
                                                        'patent_abstract': str,
                                                        'wipo_kind': str,
                                                        'num_claims': int,
                                                        'withdrawn': int,
                                                        'filename': str}, parse_dates=[2])
df_patent.drop(df_patent[df_patent['patent_type'] != 'utility'].index, inplace=True)
df_patent['patent_details'] = df_patent['patent_title'].astype(str) + df_patent['patent_abstract'].astype(str)
df_patent.drop(axis=1, columns=df_patent.columns[3:-1], inplace=True)
df_patent.head(20) # here is the 

Unnamed: 0,patent_id,patent_type,patent_date,patent_details
0,10000000,utility,2018-06-19,"Coherent LADAR using intra-pixel quadrature detectionA frequency modulated (coherent) laser detection and ranging system includes a read-out integrated circuit formed with a two-dimensional array of detector elements each including a photosensitive region receiving both return light reflected from a target and light from a local oscillator, and local processing circuitry sampling the output of the photosensitive region four times during each sample period clock cycle to obtain quadrature components. A data bus coupled to one or more outputs of each of the detector elements receives the quadrature components from each of the detector elements for each sample period and serializes the received quadrature components. A processor coupled to the data bus receives the serialized quadrature components and determines an amplitude and a phase for at least one interfering frequency corresponding to interference between the return light and the local oscillator light using the quadrature comp..."
1,10000001,utility,2018-06-19,"Injection molding machine and mold thickness control methodThe injection molding machine includes a fixed platen, a moveable platen moving forward and backward by a toggle link, a base plate supporting the toggle link, a driving part for mold clamping to operate the toggle link, a driving part for mold thickness adjustment to adjust a mold thickness, and a control unit to calculate a movement distance gap before a clamping process by controlling the driving part for mold thickness adjustment to move the base plate backward and then move the base plate forward to a target movement position based on a fold amount of the toggle link, and control the driving part for mold thickness adjustment using a value obtained by deducting the movement distance gap from the fold amount of the toggle link when producing a clamp force."
2,10000002,utility,2018-06-19,"Method for manufacturing polymer film and co-extruded filmThe present invention relates to: a method for manufacturing a polymer film, the method including a base film forming step for co-extruding a first resin containing a polyamide-based resin and a second resin containing a copolymer including polyamide-based segments and polyether-based segments; a co-extruded film including a base film including a first resin layer containing a polyamide-based resin, and a second resin layer containing a copolymer having polyamide-based segments and polyether-based segments; to a co-extruded film including a base film including a first resin layer and a second resin layer, which have different melting points; and to a method for manufacturing a polymer film, the method including a base film forming step including a step of co-extruding a first resin and a second resin, which have different melting points."
3,10000003,utility,2018-06-19,"Method for producing a container from a thermoplasticThe invention relates to a method for producing a container (2) from a thermoplastic, having at least one surround (4), provided in the container wall (1), for a container opening. The surround (4) comprises a structure behind which parts of the container wall (1) extend and/or which is penetrated by said parts. The method is carried out using a multi-part blow mold that has at least two mold parts, each having at least one cavity, wherein the surround is placed as an insert in the cavity (10) of the blow mold (7). The method comprises pressing the preform that has been forced into the cavity (10) into the structure of the surround (4) by means of a tool which is brought to bear on the preform (12) on the side of the preform facing away from the cavity (10)."
4,10000004,utility,2018-06-19,"Process of obtaining a double-oriented film, co-extruded, and of low thickness made by a three bubble process that at the time of being thermoformed provides a uniform thickness in the produced trayThe present invention relates to provides a double-oriented film, co-extrude, and of low thickness, with a layered composition that gives the property of being of high barrier to gases and manufactured by the process of co-extrusion of 3 bubbles, which gives the property of when being thermoformed, ensure the distribution of uniform thickness in the walls, base, folds, and corners of the formed tray saving a minimum of 50% of plastic without diminishing its gas barrier and its resistance to puncture."
5,10000005,utility,2018-06-19,"Article vacuum formation method and vacuum forming apparatusA vacuum forming apparatus is provided that forms an article having a covering bonded to the surface of a substrate in a molding space using a first mold and a second mold. The vacuum forming apparatus is provided with clamps for grasping the covering between the first and second molds arranged at the open positions. The clamps are movable between an interfering position, at which the clamps are located in the movement ranges of the first and second molds, and standby positions, at which the clamps are outside the movement ranges. After the covering is heated, the clamps grasping the covering move to the standby positions and stretch the covering. The first and second molds move to the closed positions and the article is molded between the first and second molds so that the stretched covering and the substrate are bonded to each other."
6,10000006,utility,2018-06-19,"Thermoforming mold device and a process for its manufacture and useA thermoforming mold device (1) providing a piece with a thin wall starting with a sheet of thermoplastic material is provided. At least one (3) of two parts of the mold (3, 3′) comprises at least one means (4) of local deformation of a sheet (2′) in the mold (3, 3′) in its closed state, the at least one means (4) comprises a piece of hollow molding with a peripheral edge, which can be connected selectively to a source of suction and can be displaced between a folded position, in which the molding piece is situated in close proximity with the wall of the thermoformed piece, and a deployed position, in which the molding piece is applied under pressure with its peripheral edge against the wall of the thermoformed piece upholding the other part of the mold."
7,10000007,utility,2018-06-19,"PEX expanding toolAn expanding tool comprising: an actuator comprising a cylindrical housing that defines an actuator housing cavity; a primary ram disposed within the actuator housing cavity, the primary ram defining an internal primary ram cavity; a secondary ram disposed within the internal primary ram cavity; a cam roller carrier coupled to a distal end of the secondary ram; a drive collar positioned within a distal end of the actuator housing cavity; a roller clutch disposed within an internal cavity defined by an inner surface of the drive collar; a shuttle cam positioned between the roller clutch and a distal end of the primary ram; an expander cone coupled to the primary ram; and an expander head operably coupled to the drive collar."
8,10000008,utility,2018-06-19,"Bracelet mold and method of useA decorated strip of coated, heat-shrinkable, plastic sheet material is placed in a spiral slot formed in a silicone rubber mold. The spiral slot is defined by a spiral wall having a uniform wall thickness. Upon heating in an oven, the material shrinks, forming a resiliently expansible arc-shaped band that can be worn as a bracelet or wristband."
9,10000009,utility,2018-06-19,"Sterile environment for additive manufacturingIn sterile, additive manufacturing wherein one lamella is successively built upon an underlying lamella until an object is completed, a sterile manufacturing environment is provided. A major chamber large enough to accommodate the manufactured object has sterile accordion pleated sidewalls and a sterile top closed with flap valves. A minor chamber for supporting the nozzles positioned above the major chamber has similar valves in corresponding positions. Nozzles for material deposition penetrate the pair of valves to block air and particles from entry into the major chamber where the nozzles make layer by layer deposition of the object using XY areawise nozzle motion relative to the object as well as Z nozzle vertical motion with the major chamber expanding as the object is formed."


In [5]:
df_cpc = pd.read_table("g_cpc_current.tsv", delimiter="\t", dtype={"patent_id": str,
                                                               "cpc_sequence": int,
                                                               "cpc_section": str,
                                                               "cpc_subclass": str,
                                                               "cpc_group": str,
                                                               "cpc_type": str,
                                                               "cpc_symbol_position": str})
df_cpc.head(20)

Unnamed: 0,patent_id,cpc_sequence,cpc_section,cpc_class,cpc_subclass,cpc_group,cpc_type,cpc_symbol_position
0,4796895,1,F,F16,F16H,F16H61/00,inventional,
1,10913199,0,B,B29,B29C,B29C55/08,inventional,
2,5208443,0,B,B29,B29C,B29C65/366,inventional,
3,7830588,6,G,G09,G09G,G09G2310/0275,additional,
4,7232943,1,A,A01,A01H,A01H5/10,inventional,
5,10815370,2,C,C08,C08F,C08F265/08,inventional,
6,8271025,4,H,H04,H04M,H04M15/00,inventional,
7,8208778,1,G,G02,G02B,G02B6/12002,inventional,
8,10299603,18,B,B64,B64D,B64D11/00154,inventional,
9,10941581,24,B,B32,B32B,B32B2255/10,additional,


In [6]:
online_df = pd.read_csv('onlinejobpostings.csv')
online_df['date'] = pd.to_datetime(online_df['date'], errors = 'coerce')
online_df.drop(axis=1, inplace=True, columns=online_df.columns[4:])
online_df.head(3)

Unnamed: 0,jobpost,date,Title,Company
0,"AMERIA Investment Consulting Company\r\nJOB TITLE: Chief Financial Officer\r\nPOSITION LOCATION: Yerevan, Armenia\r\nJOB DESCRIPTION: AMERIA Investment Consulting Company is seeking a\r\nChief Financial Officer. This position manages the company's fiscal and\r\nadministrative functions, provides highly responsible and technically\r\ncomplex staff assistance to the Executive Director. The work performed\r\nrequires a high level of technical proficiency in financial management\r\nand investment management, as well as management, supervisory, and\r\nadministrative skills.\r\nJOB RESPONSIBILITIES: \r\n- Supervises financial management and administrative staff, including\r\nassigning responsibilities, reviewing employees' work processes and\r\nproducts, counseling employees, giving performance evaluations, and\r\nrecommending disciplinary action;\r\n- Serves as member of management team participating in both strategic\r\nand operational planning for the company;\r\n- Directs and ove...",2004-01-05,Chief Financial Officer,AMERIA Investment Consulting Company
1,"International Research & Exchanges Board (IREX)\r\nTITLE: Full-time Community Connections Intern (paid internship)\r\nDURATION: 3 months\r\nLOCATION: IREX Armenia Main Office; Yerevan, Armenia \r\nDESCRIPTION: IREX currently seeks to fill the position of a paid\r\nIntern for the Community Connections (CC) Program. The position is based\r\nin the Yerevan office however applicants must be willing to travel\r\nthroughout Armenia as necessary. This position reports directly to the\r\nCC Program Manager.\r\nRESPONSIBILITIES: \r\n- Presenting the CC program to interested parties; \r\n- Assisting in planning and scheduling of programmatic meetings and\r\nevents (this includes coordinating logistics for CC staff, visitors and\r\nparticipants);\r\n- Assisting the Program Staff;\r\n- Translation/Interpretation from Armenian to English and vice versa;\r\n- Helping create, maintain and update the CC filing system and\r\ndatabases;\r\n- Completing general administrative tasks for the CC...",2004-01-07,Full-time Community Connections Intern (paid internship),International Research & Exchanges Board (IREX)
2,"Caucasus Environmental NGO Network (CENN)\r\nJOB TITLE: Country Coordinator\r\nPOSITION DURATION: Renewable annual contract\r\nPOSITION LOCATION: Yerevan, Armenia\r\nJOB DESCRIPTION: Public outreach and strengthening of a growing\r\nnetwork of environmental NGOs, businesses, international organizations\r\nand public agencies. Will serve as primary contact between CENN and\r\npublic. This is a full-time position.\r\nJOB RESPONSIBILITIES: \r\n- Working with the Country Director to provide environmental information\r\nto the general public via regular electronic communications and serving\r\nas the primary local contact to Armenian NGOs and businesses and the\r\nArmenian offices of international organizations and agencies;\r\n- Helping to organize and prepare CENN seminars/ workshops;\r\n- Participating in defining the strategy and policy of CENN in Armenia,\r\nthe Caucasus region and abroad.\r\nREQUIRED QUALIFICATIONS: \r\n- Degree in environmentally related field, or 5 years ...",2004-01-07,Country Coordinator,Caucasus Environmental NGO Network (CENN)


In [None]:
job_text = online_df.jobpost.apply(gensim.utils.simple_preprocess)
job_text

In [10]:
model = gensim.models.Word2Vec(
    window=7,
    min_count=2,
    workers=4
)

In [11]:
model.build_vocab(job_text, progress_per=1000)

In [12]:
model.train(job_text, total_examples=model.corpus_count, epochs=model.epochs)

(25435125, 33680420)

In [16]:
model.wv.most_similar("bad")

[('debts', 0.6417278051376343),
 ('losses', 0.6056818962097168),
 ('borrower', 0.6048856973648071),
 ('debtors', 0.6019881367683411),
 ('coins', 0.5979039669036865),
 ('solvency', 0.5964409708976746),
 ('derivatives', 0.5940613150596619),
 ('pooling', 0.5937293767929077),
 ('penalties', 0.5853939652442932),
 ('signs', 0.5823273062705994)]