# Trademark Case Files Dataset

This dataset is found at https://www.uspto.gov/learning-and-resources/electronic-data-products/trademark-case-files-dataset-0.

The Trademark Case Files Dataset contains detailed information on 8.6 million trademark applications filed with or registrations issued by the USPTO between January 1870 and January 2017. It is derived from the USPTO main database for administering trademarks and includes data on mark characteristics, prosecution events, ownership, classification, third-party oppositions, and renewal history.

Full schema can be found at: https://www.uspto.gov/sites/default/files/documents/casefiles_schema_high_level_2016update.pdf

Data dictionary found at: https://www.uspto.gov/sites/default/files/documents/vartable_2016.pdf

For this analysis we'll only use the Case files, owners (and related owner change table), and events table for exploring some details on existing and past trademark applications and registrations, the parties which own them, and their history of administrative actions taken from beginning to end.

More details on the full dataset can be found in this publication from the US Patent and Trademark office: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2188621

### Load our packages

In [1]:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

In [2]:
import sqlContext.implicits._

In [3]:
import org.apache.spark.sql.types._

In [4]:
import org.apache.spark.sql.functions._

### Load our data
#### Case files data: 
Table which describe the current status of all trademark applications from 1870 - 2017 as of the time of the last data update (January 2017). 

In [5]:
val case_files_data = spark.read.format("csv").option("header", "true").option("quote", "\"").option("escape", "\"").option("mode", "DROPMALFORMED").option("inferSchema", "true").load("hdfs://sandbox.hortonworks.com:8020/tmp/case_file.csv")

The case files data is quite large (over 8,000,000 records) and wide (79) variables. Let's just grab the columns we're interested in after reviewing the data schema here (https://www.uspto.gov/sites/default/files/documents/vartable_2016.pdf)

In [6]:
val case_files_cols = Seq("serial_no",
"abandon_dt",
"amend_reg_dt",
"reg_cancel_cd",
"reg_cancel_dt",
"cancel_pend_in",
"cert_mark_in",
"chg_reg_in",
"coll_memb_mark_in",
"coll_serv_mark_in",
"coll_trade_mark_in",
"serv_mark_in",
"draw_color_cur_in",
"draw_color_file_in",
"concur_use_in",
"concur_use_pend_in",
"filing_dt",
"for_priority_in",
"lb_itu_cur_in",
"lb_itu_file_in",
"interfer_pend_in",
"exm_office_cd",
"file_location_dt",
"mark_draw_cd",
"mark_id_char",
"opposit_pend_in",
"amend_principal_in",
"concur_use_pub_in",
"publication_dt",
"registration_dt",
"renewal_dt",
"renewal_file_in",
"cfh_status_cd",
"cfh_status_dt",
"trade_mark_in",
"registration_no")

In [7]:
val case_files_lite = case_files_data.select(case_files_cols.head, case_files_cols.tail: _*)

Next, since this covers such wide range of time (almost 150 years of trademark registrations!), we want to look at trends over the years. We'll create year columns for all of the date columns since we'll access them a lot and don't want to regularly have to compute on the fly from the date columns. 

In [8]:
val case_files_lite_years = case_files_lite.
                            withColumn("filing_yr",year(case_files_lite("filing_dt"))).
                            withColumn("abandon_yr",year(case_files_lite("abandon_dt"))).
                            withColumn("publication_yr",year(case_files_lite("publication_dt"))).
                            withColumn("registration_yr",year(case_files_lite("registration_dt"))).
                            withColumn("cfh_status_yr",year(case_files_lite("cfh_status_dt"))).
                            withColumn("renewal_yr",year(case_files_lite("renewal_dt"))).
                            withColumn("reg_cancel_yr",year(case_files_lite("reg_cancel_dt")))

And let's then write out the derivative to a parquet file so we don't need to do this everytime

In [9]:
case_files_lite_years.write.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/case_files_processed.parquet")

Name: org.apache.spark.sql.AnalysisException
Message: path hdfs://sandbox.hortonworks.com:8020/tmp/case_files_processed.parquet already exists.;
StackTrace:   at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:80)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPla

Then let's load it back in for faster traversal of the case files data

In [10]:
val case_files = sqlContext.read.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/case_files_processed.parquet")

#### Events data:
Table which describes the event history of each trademark application from initiation through registration (or denial) and includes renewals, abandonments, and expiration.

In [None]:
val events_raw = spark.read.format("csv").option("header", "true").option("quote", "\"").option("escape", "\"").option("mode", "DROPMALFORMED").option("inferSchema", "true").load("hdfs://sandbox.hortonworks.com:8020/tmp/event.csv")

#### Owners and owner change data:
Two tables that track the owning parties and history of changes of ownership of trademarks in the Case files table. Owners related to case files table through serial number.

In [None]:
val owners_raw = spark.read.format("csv").option("header", "true").option("quote", "\"").option("escape", "\"").option("mode", "DROPMALFORMED").option("inferSchema", "true").load("hdfs://sandbox.hortonworks.com:8020/tmp/owner.csv")

### Question 1: Trends in types of  trademarks (service mark, trademark, certification mark, or collective mark)
- Ref: https://www.bitlaw.com/source/tmep/1306_01.html
- Check mark type for trends in service mark registrations like the dotcom boom
- Can we see an increase in trademark applications in general like the 1990s dotcom boom

In [7]:
val service_marks = case_files.filter($"serv_mark_in" === 1)

In [8]:
service_marks.count() //this is about a third of the total observations, as mentioned in the publication

2892932

In [9]:
val service_counts = service_apps.filter($"filing_yr".isNotNull).groupBy("filing_yr").agg(count("*") as "year_count").orderBy($"filing_yr" asc)

In [10]:
service_counts.show(150)

+---------+----------+
|filing_yr|year_count|
+---------+----------+
|     1899|         1|
|     1911|         1|
|     1915|         1|
|     1931|         3|
|     1932|         3|
|     1933|         1|
|     1934|         1|
|     1935|         1|
|     1937|         1|
|     1938|         2|
|     1939|         3|
|     1944|         1|
|     1946|         4|
|     1947|       170|
|     1948|       142|
|     1949|        87|
|     1950|        86|
|     1951|        82|
|     1952|       115|
|     1953|       140|
|     1954|       136|
|     1955|       145|
|     1956|       142|
|     1957|       143|
|     1958|       160|
|     1959|       221|
|     1960|       311|
|     1961|       582|
|     1962|       639|
|     1963|       550|
|     1964|       719|
|     1965|       864|
|     1966|      1025|
|     1967|      1146|
|     1968|      1444|
|     1969|      1916|
|     1970|      1875|
|     1971|      1754|
|     1972|      2048|
|     1973|      2131|
|     1974|

Pretty interesting. The original authors noticed a spike in service trademarks leading up to the dotcom boom with a noticeable drop off in 2001 thorugh 2004. Since 2005, we're seeing service trademarks steadily rise again, only dipping perhaps trivially after the economic crash of Oct 2008, only to steadily rise to historic highs in 2015 and 2016.

In [11]:
//lets save that table out to csv so we can visualize later
service_counts.write.csv("hdfs://sandbox.hortonworks.com:8020/tmp/output/service_counts.csv")

### Question 2: Getting a little more complicated and looking at the bigger picture of which trademarks are opposed vs. those that are published.
- Do oppositions increase over time as more and more trademarks are registered, or do the stay pretty stable?
    - Service mark, trademark, certification mark (?) - only reason to join to case_file table
    - Bucket and aggregate by year
    - When were oppositions generated?
    - When were most oppositions sustained?


Events are logged in the event table and of the 799 different event statuses (see https://eipweb.uspto.gov/TrademarkCaseFileEconomics/2011/event_description.csv.zip), we want to examine those involving oppositions, or rather, the action, after a trademark is accepted by the examiners, it is "Published for opposition" for third parties to make a claim that challenges a trademark currently in process of being registered. There are 5 statuses we'll look at from the event_description file:


| event_code | event_type | event_desc                                     | event_count |
|------------|------------|------------------------------------------------|-------------|
| PUBO       | A          | PUBLISHED FOR OPPOSITION                       | 5399155     |
| R.PR       | A          | REGISTERED-PRINCIPAL REGISTER                  | 3852684     |
| OP.I       | T          | OPPOSITION INSTITUTED NO. 999999               | 147676      |
| OP.T	     | T 	      | OPPOSITION TERMINATED NO. 999999	           | 144702      |
| OP.D       | T          | OPPOSITION DISMISSED NO. 999999                | 78278       |
| OP.S       | T          | OPPOSITION SUSTAINED NO. 999999                | 61363       |

After **PUBLISHED FOR OPPOSITION** the next major event to indicate that an opposition has actually been made is **OPPOSITION INSTITUTED**. Of that subset, the two major are outcomes are either **OPPOSITION DISMISSED** (and it can go on to be registered) or **OPPOSITION SUSTAINED** (and it will go on to be abandoned).

*Note*: As there are so many different event codes and vmany different ways for an application to move through the trademark application process, there are events we are not accounting for in this analysis. With more time and understanding of the trademark process we'd be able to address those additional factors. But for this analysis, that we are focusing on the majority of outcomes related to opposing a trademark registration, this is at least somewhat informative.

**And so we ask**: Over time, when were opposition instituted most prevelant, and when were there more or less oppositions sustained (as related to the number of applications published for opposition and registered principal register by year)? 

Since this is a little more complicated in that it involves some table joining and more computation, how do we want to do this w/o making out infrastructure suffer?

- **Hive** (No. Just use parquet since this is a one time analysis and we just want to pinpoint specific columns to work with to answer a few simple questions. If we intended to do longitudinal, regularly updated analyses which depending on reliable ACID transactions we might create a pipeline into Hive for persistence. For now parquet should suffice for improviing query performance on a quasi-static dataset)
- **Broadcast variables?** 

### Question 3 Owners:
- Who has the most trademarks?
- What are some trends in ownership over time?
    - see owner table and join with case_files

In [2]:
val uown = owners_raw.groupBy("own_type_cd").agg(count("*") as "own_count").orderBy($"own_count" desc)

Name: Compile Error
Message: <console>:17: error: not found: value owners_raw
       val uown = owners_raw.groupBy("own_type_cd").agg(count("*") as "own_count").orderBy($"own_count" desc)
                  ^
<console>:17: error: not found: value count
       val uown = owners_raw.groupBy("own_type_cd").agg(count("*") as "own_count").orderBy($"own_count" desc)
                                                        ^
<console>:17: error: value $ is not a member of StringContext
       val uown = owners_raw.groupBy("own_type_cd").agg(count("*") as "own_count").orderBy($"own_count" desc)
                                                                                           ^
StackTrace: 

### Question 4: length of marks over time

In [12]:
val case_mark_lengths = case_files.filter($"filing_yr".isNotNull && $"mark_id_char".isNotNull).select("mark_id_char", "filing_yr").withColumn("count", size(split($"mark_id_char", " "))).orderBy($"filing_yr" asc)

In [13]:
val avg_mark_counts = case_mark_lengths.groupBy("filing_yr").agg(mean("count")).orderBy($"filing_yr" asc)

In [14]:
avg_mark_counts.write.csv("hdfs://sandbox.hortonworks.com:8020/tmp/output/avg_mark_counts.csv")

Name: org.apache.spark.sql.AnalysisException
Message: path hdfs://sandbox.hortonworks.com:8020/tmp/output/avg_mark_counts.csv already exists.;
StackTrace:   at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:80)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.