# Patent Classifier


This is the notebook for prototyping the patent classifier code. The following is the first exploration of the data sets from the USPTO Patent View database. 

# Data

The data is three tab separated variable text files from the USPTO Patent View database. The data sets are relatively large with the largest of the three, the data set containing the claims data, at 40 GB. The next largest data set is the one containing the data about each patent. This data included the title, abstract and date and took up about 6 GB of space. The third and smallest data set contained data the cpc classification info for each patent. The cpc classification file takes up about 4 GB of space. The variance in data sets size are due to both the difference in the number of rows and the ammount of text data in each element. Each of the data sets contain data needed for the patent classification. However, not all of the data from each data set are needed. Only a subset of the data was used in the classification. Thus to simplify our task a new tsv file was created with only the relavent information. This partioned data set is generated in this notebook with the scala code is spark below.




In [1]:
val cpc = spark.read.format("csv")
  .option("sep", "\t")
  .option("header", "true")
  .load("../Projects/data/cpc_current.tsv")

val patent = spark.read.format("csv")
  .option("sep", "\t")
  .option("header", "true")
  .load("../Projects/data/patent.tsv")

val claim = spark.read.format("csv")
  .option("sep", "\t")
  .option("header", "true")
  .load("../Projects/data/claim.tsv")

org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/alex/Projects/data/cpc_current.tsv;

In [2]:
println(cpc.count())
println(patent.count())
println(claim.count())

39915464
7144430
101535754


In [3]:
cpc.show(5)

+--------------------+---------+----------+-------------+--------+--------------+-----------+--------+
|                uuid|patent_id|section_id|subsection_id|group_id|   subgroup_id|   category|sequence|
+--------------------+---------+----------+-------------+--------+--------------+-----------+--------+
|000016xombd5lbk9l...|  7070831|         H|          H01|    H01L|H01L2924/01013| additional|      22|
|000070runw99gxjki...|  7618693|         C|          C09|    C09D|     C09D11/30|inventional|       1|
|00008erwm5297s6wv...|  8488869|         G|          G06|    G06T|G06T2207/10016| additional|      20|
|00008q01v2ziacpr0...|  9976665|         A|          A61|    A61M|   A61M5/16886|inventional|       4|
|00008rwbcfjb44c0m...|  9448251|         H|          H01|    H01L|     H01L29/84|inventional|       6|
+--------------------+---------+----------+-------------+--------+--------------+-----------+--------+
only showing top 5 rows



In [4]:
patent.show(5)

+--------+-------+--------+-------+----------+--------------------+--------------------+----+----------+-------------+---------+
|      id|   type|  number|country|      date|            abstract|               title|kind|num_claims|     filename|withdrawn|
+--------+-------+--------+-------+----------+--------------------+--------------------+----+----------+-------------+---------+
|10000000|utility|10000000|     US|2018-06-19|A frequency modul...|Coherent LADAR us...|  B2|        20|ipg180619.xml|     NULL|
|10000001|utility|10000001|     US|2018-06-19|The injection mol...|Injection molding...|  B2|        12|ipg180619.xml|     NULL|
|10000002|utility|10000002|     US|2018-06-19|The present inven...|Method for manufa...|  B2|         9|ipg180619.xml|     NULL|
|10000003|utility|10000003|     US|2018-06-19|The invention rel...|Method for produc...|  B2|        18|ipg180619.xml|     NULL|
|10000004|utility|10000004|     US|2018-06-19|The present inven...|Process of obtain...|  B2|    

In [5]:
claim.show(5)

+--------------------+---------+--------------------+---------+--------+---------+
|                uuid|patent_id|                text|dependent|sequence|exemplary|
+--------------------+---------+--------------------+---------+--------+---------+
|00000dv6xkiuyewi5...|  4968079|A golf ball retri...|       -1|       1|     True|
|00000w0pl9vz7nts0...|  8266944|4. The method of ...|        3|       4|    False|
|00000yv19kqb063az...|  6992283|77. A mass spectr...|       15|      77|    False|
|000021tixo539g81a...|  8745515|3. The method acc...|        1|       3|    False|
|00002oe7jg97rmmep...|  4149148|The apparatus of ...|       14|      15|    False|
+--------------------+---------+--------------------+---------+--------+---------+
only showing top 5 rows



The categories needed are patent_id, cpc, claim text, and text. Also needed is to remove repeated elements in the data. This is conducted by concatenating the claims and cpc group_id into one element in the tsv so that each row is an unique patent id number  




In [7]:
val grouped_claim = claim.select("patent_id","text")
    .groupBy("patent_id")
    .agg(concat_ws(",",collect_list("text")).alias("claims"))

val patent_cpc =patent
    .join(cpc,cpc("patent_id")===patent("id"))
    .filter(cpc("sequence")===0)
    .drop("patent_id")
    .select("id","group_id","date")

val df = grouped_claim
    .join(patent_cpc,patent_cpc("id") === grouped_claim("patent_id"))
    .drop("patent_id")


In [8]:
df.show()

+--------------------+--------+--------+----------+
|              claims|      id|group_id|      date|
+--------------------+--------+--------+----------+
|8. The refrigerat...|10000108|    B60H|2018-06-19|
|14. The method of...|10000172|    B60R|2018-06-19|
|11. The method of...|10000304|    B65B|2018-06-19|
|13. The compound ...|10000454|    C07D|2018-06-19|
|13. The photochro...|10000472|    C07D|2018-06-19|
|5. The method acc...|10000528|    C07K|2018-06-19|
|3. The process ac...|10000591|    C08F|2018-06-19|
|10. The assembly ...|10000670|    C09J|2018-06-19|
|8. A method for r...|10000720|    C10M|2018-06-19|
|14. The system of...|10000723|    C11B|2018-06-19|
|5. The recombinan...|10000761|    C12N|2018-06-19|
|3. The method of ...|10000835|    C22F|2018-06-19|
|14. The composite...|10000989|    E21B|2018-06-19|
|6. The magazine l...|10001331|    F41A|2018-06-19|
|14. The method of...|10001922|    G06F|2018-06-19|
|1. A method for v...|10001989|    G06F|2018-06-19|
|3. The syst

Here we will save the progress of the partitioned data. We will also count the number of elements in the cpc file that has the value of zero for sequence




In [10]:
val df_saver = df.write.save("../Projects/data/partitioned_hdfs")

In [11]:
cpc.filter(col("sequence") === "0").count()

6452079