# Trademark Case Files Dataset

This dataset is found at https://www.uspto.gov/learning-and-resources/electronic-data-products/trademark-case-files-dataset-0.

The Trademark Case Files Dataset contains detailed information on 8.6 million trademark applications filed with or registrations issued by the USPTO between January 1870 and January 2017. It is derived from the USPTO main database for administering trademarks and includes data on mark characteristics, prosecution events, ownership, classification, third-party oppositions, and renewal history.

Full schema can be found at: https://www.uspto.gov/sites/default/files/documents/casefiles_schema_high_level_2016update.pdf

Data dictionary found at: https://www.uspto.gov/sites/default/files/documents/vartable_2016.pdf

For this analysis we'll only use the Case files and events table for exploring some details on existing and past trademark applications and registrations and their history of administrative actions taken from beginning to end.

More details on the full dataset can be found in this publication from the US Patent and Trademark office: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2188621

### Full Dataset Schema

<img src="./imgs/case_files_data_schema.png" width="75%" height="75%"/>

Of the 14 data tables in the dataset, we'll only use the **case_file** and **event** tables. They are of sufficient size and complexity that this should be enough for a first pass at this dataset.

### Load our packages

In [2]:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

In [3]:
import sqlContext.implicits._

In [4]:
import org.apache.spark.sql.types._

In [5]:
import org.apache.spark.sql.functions._

### Load our data
#### Case files data: 
Table which describe the current status of all trademark applications from 1870 - 2017 as of the time of the last data update (January 2017). 

In [None]:
val case_files_raw = spark.read.format("csv").option("header", "true").option("quote", "\"").option("escape", "\"").option("mode", "DROPMALFORMED").option("inferSchema", "true").load("hdfs://sandbox.hortonworks.com:8020/tmp/case_file.csv")

The case files data is quite large (over 8,000,000 records) and wide (79) variables. Let's just grab the columns we're interested in after reviewing the data schema here (https://www.uspto.gov/sites/default/files/documents/vartable_2016.pdf)

In [None]:
val case_files_cols = Seq("serial_no",
"abandon_dt",
"amend_reg_dt",
"reg_cancel_cd",
"reg_cancel_dt",
"cancel_pend_in",
"cert_mark_in",
"chg_reg_in",
"coll_memb_mark_in",
"coll_serv_mark_in",
"coll_trade_mark_in",
"serv_mark_in",
"draw_color_cur_in",
"draw_color_file_in",
"concur_use_in",
"concur_use_pend_in",
"filing_dt",
"for_priority_in",
"lb_itu_cur_in",
"lb_itu_file_in",
"interfer_pend_in",
"exm_office_cd",
"file_location_dt",
"mark_draw_cd",
"mark_id_char",
"opposit_pend_in",
"amend_principal_in",
"concur_use_pub_in",
"publication_dt",
"registration_dt",
"renewal_dt",
"renewal_file_in",
"cfh_status_cd",
"cfh_status_dt",
"trade_mark_in",
"registration_no")

In [None]:
val case_files_lite = case_files_raw.select(case_files_cols.head, case_files_cols.tail: _*)

Next, since this covers such wide range of time (almost 150 years of trademark registrations!), we want to look at trends over the years. We'll create year columns for all of the date columns since we'll access them a lot and don't want to regularly have to compute on the fly from the date columns. 

And while we are preparing some of the variables for easier analysis later, let's put the indicator for what type of trademark each case is (service, certification, or trademark) into one column since in the original dataset, it is spread across multiple columns with a boolean value in each column.

In [None]:
val case_files_lite_years_types = case_files_lite.
                            withColumn("filing_yr",year(case_files_lite("filing_dt"))).
                            withColumn("abandon_yr",year(case_files_lite("abandon_dt"))).
                            withColumn("publication_yr",year(case_files_lite("publication_dt"))).
                            withColumn("registration_yr",year(case_files_lite("registration_dt"))).
                            withColumn("cfh_status_yr",year(case_files_lite("cfh_status_dt"))).
                            withColumn("renewal_yr",year(case_files_lite("renewal_dt"))).
                            withColumn("reg_cancel_yr",year(case_files_lite("reg_cancel_dt"))).
                            withColumn("mark_type", when($"cert_mark_in" === 1, "certification").
                                                    when($"serv_mark_in" === 1, "service").
                                                    when($"trade_mark_in" === 1, "trademark").
                                                    otherwise("collective"))

In [None]:
case_files_lite_years_types.filter($"mark_type"==="trademark").count()

And let's then write out the derivative to a parquet file so we don't need to do this everytime

In [None]:
case_files_lite_years_types.write.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/case_files_processed.parquet")

Then let's load it back in for faster traversal of the case files data

In [23]:
val case_files = sqlContext.read.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/case_files_processed.parquet")

#### Events data:
Table which describes the event history of each trademark application from initiation through registration (or denial) and includes renewals, abandonments, and expiration.

In [None]:
val events_raw = spark.read.format("csv").option("header", "true").option("quote", "\"").option("escape", "\"").option("mode", "DROPMALFORMED").option("inferSchema", "true").load("hdfs://sandbox.hortonworks.com:8020/tmp/event.csv")

In [None]:
events_raw.columns

### Question 1: Trends in types of  trademarks (service mark, trademark, certification mark, or collective mark)
- Ref: https://www.bitlaw.com/source/tmep/1306_01.html
- Check mark type for trends in service mark registrations like the dotcom boom
- Can we see an increase in trademark applications in general like the 1990s dotcom boom

In [24]:
val service_marks = case_files.filter($"serv_mark_in" === 1)

In [25]:
service_marks.count() //this is about a third of the total observations, as mentioned in the publication

2892932

In [26]:
val certification_marks = case_files.filter($"cert_mark_in" === 1)

In [27]:
certification_marks.count()

11893

In [28]:
val trademarks = case_files.filter($"trade_mark_in" === 1)

In [29]:
trademarks.count()

5154980

In [30]:
val service_counts = service_marks.filter($"filing_yr".isNotNull).groupBy("filing_yr").agg(count("*") as "serv_count").na.fill(0, Seq("serv_count")).orderBy($"filing_yr" asc)

In [31]:
val certification_counts = certification_marks.filter($"filing_yr".isNotNull).groupBy("filing_yr").agg(count("*") as "cert_count").na.fill(0, Seq("cert_count")).orderBy($"filing_yr" asc)

In [32]:
val trademark_counts = trademarks.filter($"filing_yr".isNotNull).groupBy("filing_yr").agg(count("*") as "trade_count").na.fill(0, Seq("trade_count")).orderBy($"filing_yr" asc)

In [36]:
val mark_type_counts_by_year = service_counts.join(certification_counts, Seq("filing_yr"), "left_outer").join(trademark_counts, Seq("filing_yr"), "left_outer")

In [37]:
val sorted_mark_type_counts_by_year = mark_type_counts_by_year.orderBy($"filing_yr" asc).na.fill(0, Seq("cert_count"))

In [38]:
sorted_mark_type_counts_by_year.show()

                                                                                +---------+----------+----------+-----------+
|filing_yr|serv_count|cert_count|trade_count|
+---------+----------+----------+-----------+
|     1899|         1|         0|         46|
|     1911|         1|         0|        261|
|     1915|         1|         0|        278|
|     1931|         3|         0|       1023|
|     1932|         3|         0|        856|
|     1933|         1|         0|        975|
|     1934|         1|         0|       1141|
|     1935|         1|         0|       1034|
|     1937|         1|         0|        922|
|     1938|         2|         0|        928|
|     1939|         3|         1|        879|
|     1944|         1|         0|       1687|
|     1946|         4|         3|       2594|
|     1947|       170|         6|       6915|
|     1948|       142|         5|       5301|
|     1949|        87|         5|       3731|
|     1950|        86|         3|       3542

Pretty interesting. The original authors noticed a spike in service marks leading up to the dotcom boom with a noticeable drop off in 2001 thorugh 2004. Since 2005, we're seeing service marks steadily rise again, only dipping perhaps trivially after the economic crash of Oct 2008, only to steadily rise to historic highs in 2015 and 2016.

Compared to service marks, it's interesting to note that trademarks don't seem to have as noticeble interruption in numbers year over year, except for a slight drop in 2009, which again we might hypothesize was a result of the 2008 crash

In [39]:
//lets save that table out to csv so we can visualize later
sorted_mark_type_counts_by_year.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("hdfs://sandbox.hortonworks.com:8020/tmp/output/mark_type_counts_by_year.csv")

### Question 2: Getting a little more indepth and looking at the bigger picture of opposition to trademark applications
- Do oppositions increase over time as more and more trademarks are registered, or do the stay pretty proportional?
    - Service mark, trademark, certification mark
    - Bucket and aggregate by year
    - When were oppositions generated?
    - When were most oppositions sustained?


Events are logged in the event table. Of the 799 different event statuses (see https://eipweb.uspto.gov/TrademarkCaseFileEconomics/2011/event_description.csv.zip), we want to examine those involving oppositions, or rather, the action, after a trademark is accepted by the examiner and after it is "Published for opposition" for third parties to challenge that trademark before it is registered. There are 6 statuses we'll look at from the event_description file, 4 of which relate to opposing trademark applications:


| event_code | event_type | event_desc                                     | event_count |
|------------|------------|------------------------------------------------|-------------|
| PUBO       | A          | PUBLISHED FOR OPPOSITION                       | 5399155     |
| R.PR       | A          | REGISTERED-PRINCIPAL REGISTER                  | 3852684     |
| OP.I       | T          | OPPOSITION INSTITUTED NO. 999999               | 147676      |
| OP.T	     | T 	      | OPPOSITION TERMINATED NO. 999999	           | 144702      |
| OP.D       | T          | OPPOSITION DISMISSED NO. 999999                | 78278       |
| OP.S       | T          | OPPOSITION SUSTAINED NO. 999999                | 61363       |

After **PUBLISHED FOR OPPOSITION** the next major event to indicate that an opposition has actually been made is **OPPOSITION INSTITUTED**. Of that subset, the two major are outcomes are either **OPPOSITION DISMISSED** (and it can go on to be registered) or **OPPOSITION SUSTAINED** (and it will go on to be abandoned). There is a third status in **OPPOSITION TERMINATED**, either because the registering party did not respond to the opposition, or because the registering party filed a motion in response to the opposition and the opposing party failed to respond. More details on Trademark opposition proceedings and the Trademark Trial and Appeal Board (TTAB) can be found here: http://www.wipo.int/export/sites/www/sct/en/comments/pdf/sct17/us_1.pdf

*Note*: As there are so many different event codes and many different ways for an application to move through the trademark application process, there are events we are not accounting for in this analysis. With more time and understanding of the trademark process we'd be able to address those additional factors. But for this analysis, that we are focusing on the majority of outcomes related to opposing a trademark registration, this is at least somewhat informative.

**And so we ask**: Over time, when were opposition instituted most prevelant, and when were there more or less oppositions sustained (as related to the number of applications published for opposition and registered principal register by year)? 

**First**, lets reduce the amount of work the cluster has to do by only getting the events records related to the 4 opposition event codes `OP.I`, `OP.T`, `OP.D`, or `OP.S`.

In [None]:
val filtered_events = events_raw.filter($"event_cd" === "OP.I" || $"event_cd" === "OP.T" || $"event_cd" === "OP.D" || $"event_cd" === "OP.S" || $"event_cd" === "PUBO" || $"event_cd" === "R.PR").withColumn("event_yr", year($"event_dt"))

In [None]:
filtered_events.count()

In [None]:
filtered_events.show()

At this point, it seems like having a lot of data loaded up is causing an issue for performace. Let's save the filtered events table and pick up from there when we re-run.

In [None]:
filtered_events.write.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/oppose_events.parquet")

In [None]:
val oppose_events = sqlContext.read.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/oppose_events.parquet")

**Second**, we want to make sure we can join the type to the event that took place based on serial number. So let's do that. As we recall, we created a column in the case_files table such that this would be easier to join together by serial number.

In [None]:
val case_files_types = case_files.select("serial_no", "mark_type")

In [None]:
val events_with_types = oppose_events.join(case_files_types, Seq("serial_no"), "left_outer")

In [None]:
events_with_types.show()

In [None]:
events_with_types.write.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/oppose_events_types.parquet")

In [7]:
val oppose_events_with_types = sqlContext.read.parquet("hdfs://sandbox.hortonworks.com:8020/tmp/oppose_events_types.parquet")

In [8]:
oppose_events_with_types.count()

9683862

**Third**: Now that we have a neat, enriched subset of our data to play with for the final analysis, lets do some calculations to see when the most relative oppositions were over time.

In [34]:
val event_pivot_trade = oppose_events_with_types.filter($"mark_type"==="trademark").groupBy($"event_yr").pivot("event_cd").count.orderBy($"event_yr" asc)

In [35]:
event_pivot_trade.count()

91

In [36]:
val event_pivot_serv = oppose_events_with_types.filter($"mark_type"==="service").groupBy($"event_yr").pivot("event_cd").count.orderBy($"event_yr" asc)

In [37]:
event_pivot_serv.count()

51

After pivoting, we got a lot of `org.apache.spark.sql.AnalysisException: cannot resolve [colname] given input columns: ` errors, so lets rename the columns so we can proceed.

In [38]:
event_pivot_trade.printSchema()

root
 |-- event_yr: integer (nullable = true)
 |-- OP.D: long (nullable = true)
 |-- OP.I: long (nullable = true)
 |-- OP.S: long (nullable = true)
 |-- OP.T: long (nullable = true)
 |-- PUBO: long (nullable = true)
 |-- R.PR: long (nullable = true)



In [39]:
event_pivot_serv.printSchema()

root
 |-- event_yr: integer (nullable = true)
 |-- OP.D: long (nullable = true)
 |-- OP.I: long (nullable = true)
 |-- OP.S: long (nullable = true)
 |-- OP.T: long (nullable = true)
 |-- PUBO: long (nullable = true)
 |-- R.PR: long (nullable = true)



In [40]:
val newcols = Seq("event_year", "opd_count", "opi_count", "ops_count", "opt_count", "pubo_count", "rpr_count")

In [41]:
val trade_eventsRenamed = event_pivot_trade.toDF(newcols: _*)

In [42]:
val serv_eventsRenamed = event_pivot_serv.toDF(newcols: _*)

Now we want to convert all nulls to 0 so we can do some math

In [43]:
val events_pivot_trade_no_nulls = trade_eventsRenamed.na.fill(0, Seq("opd_count")).
                                          na.fill(0, Seq("opi_count")).
                                          na.fill(0, Seq("ops_count")).
                                          na.fill(0, Seq("opt_count")).
                                          na.fill(0, Seq("pubo_count")).
                                          na.fill(0, Seq("rpr_count"))

In [44]:
val events_pivot_serv_no_nulls = serv_eventsRenamed.na.fill(0, Seq("opd_count")).
                                          na.fill(0, Seq("opi_count")).
                                          na.fill(0, Seq("ops_count")).
                                          na.fill(0, Seq("opt_count")).
                                          na.fill(0, Seq("pubo_count")).
                                          na.fill(0, Seq("rpr_count"))

In [45]:
events_pivot_trade_no_nulls.show()

                                                                                +----------+---------+---------+---------+---------+----------+---------+
|event_year|opd_count|opi_count|ops_count|opt_count|pubo_count|rpr_count|
+----------+---------+---------+---------+---------+----------+---------+
|      1901|        0|        0|        0|        0|         0|        2|
|      1903|        0|        0|        0|        0|         0|        1|
|      1904|        0|        0|        0|        0|         0|        2|
|      1905|        0|        0|        0|        0|         0|        3|
|      1906|        0|        0|        0|        0|         0|        1|
|      1908|        0|        0|        0|        0|         0|        1|
|      1911|        0|        0|        0|        0|         0|        1|
|      1914|        0|        0|        0|        0|         0|        1|
|      1916|        0|        0|        0|        0|         0|        2|
|      1917|        0|        

In [46]:
events_pivot_serv_no_nulls.show()

                                                                                +----------+---------+---------+---------+---------+----------+---------+
|event_year|opd_count|opi_count|ops_count|opt_count|pubo_count|rpr_count|
+----------+---------+---------+---------+---------+----------+---------+
|      1962|        1|        0|        0|        0|         0|        0|
|      1964|        0|        0|        0|        1|         0|        0|
|      1968|        0|        0|        0|        1|         0|        1|
|      1969|        0|        0|        0|        0|         0|        1|
|      1970|        0|        0|        0|        0|         0|        1|
|      1971|        0|        0|        0|        0|         0|        1|
|      1973|        0|        0|        0|        0|         1|        2|
|      1974|        0|        0|        0|        0|         1|        3|
|      1975|        1|        0|        0|        0|         0|        2|
|      1976|        3|        

Let's make a user defined function that will give us the percentage of two columns

In [47]:
val percUDF = udf((i:Integer, z:Integer) => ((i.toFloat/z.toFloat)*100))

In [48]:
val trade_events_calculated = events_pivot_trade_no_nulls.withColumn("percent_opi", when($"pubo_count" !== 0, percUDF($"opi_count", $"pubo_count")).otherwise(0)).
                                              withColumn("percent_ops", when($"pubo_count" !== 0, percUDF($"ops_count", $"pubo_count")).otherwise(0)).
                                              withColumn("percent_opt", when($"pubo_count" !== 0, percUDF($"opt_count", $"pubo_count")).otherwise(0)).
                                              withColumn("percent_opd", when($"pubo_count" !== 0, percUDF($"opd_count", $"pubo_count")).otherwise(0))

In [49]:
val serv_events_calculated = events_pivot_serv_no_nulls.withColumn("percent_opi", when($"pubo_count" !== 0, percUDF($"opi_count", $"pubo_count")).otherwise(0)).
                                              withColumn("percent_ops", when($"pubo_count" !== 0, percUDF($"ops_count", $"pubo_count")).otherwise(0)).
                                              withColumn("percent_opt", when($"pubo_count" !== 0, percUDF($"opt_count", $"pubo_count")).otherwise(0)).
                                              withColumn("percent_opd", when($"pubo_count" !== 0, percUDF($"opd_count", $"pubo_count")).otherwise(0))

In [50]:
trade_events_calculated.show(100)

                                                                                +----------+---------+---------+---------+---------+----------+---------+-----------+-----------+-----------+-----------+
|event_year|opd_count|opi_count|ops_count|opt_count|pubo_count|rpr_count|percent_opi|percent_ops|percent_opt|percent_opd|
+----------+---------+---------+---------+---------+----------+---------+-----------+-----------+-----------+-----------+
|      1901|        0|        0|        0|        0|         0|        2|        0.0|        0.0|        0.0|        0.0|
|      1903|        0|        0|        0|        0|         0|        1|        0.0|        0.0|        0.0|        0.0|
|      1904|        0|        0|        0|        0|         0|        2|        0.0|        0.0|        0.0|        0.0|
|      1905|        0|        0|        0|        0|         0|        3|        0.0|        0.0|        0.0|        0.0|
|      1906|        0|        0|        0|        0|         0| 

In [15]:
serv_events_calculated.show(100)

                                                                                +----------+---------+---------+---------+---------+----------+---------+-----------+-----------+-----------+-----------+
|event_year|opd_count|opi_count|ops_count|opt_count|pubo_count|rpr_count|percent_opi|percent_ops|percent_opt|percent_opd|
+----------+---------+---------+---------+---------+----------+---------+-----------+-----------+-----------+-----------+
|      1962|        1|        0|        0|        0|         0|        0|        0.0|        0.0|        0.0|        0.0|
|      1964|        0|        0|        0|        1|         0|        0|        0.0|        0.0|        0.0|        0.0|
|      1968|        0|        0|        0|        1|         0|        1|        0.0|        0.0|        0.0|        0.0|
|      1969|        0|        0|        0|        0|         0|        1|        0.0|        0.0|        0.0|        0.0|
|      1970|        0|        0|        0|        0|         0| 

Lets save these as output for visualization

In [21]:
trade_events_calculated.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("hdfs://sandbox.hortonworks.com:8020/tmp/output/trade_opposed_percents.csv")

In [22]:
serv_events_calculated.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("hdfs://sandbox.hortonworks.com:8020/tmp/output/serv_opposed_percents.csv")

### Conclusion

It's an interesting dataset, but it has some inconsistencies within the data (which the authors of the dataset acknowledge). We explored the data and tried to identify some trends with types of trademarks and trends related to opposition to trademarks. We found that service type trademarks continue to rise possibly a product of more and more internet based services protecting their intellectual property.

We had a hypothesis that oppositions to trademarks would rise as more and more trademarks are registered, but this did not necessarily seem to be the case with either trademarks or services marks.

Otherwise, this was an interesting dataset to work with in spark, it is fairly large and requires a bit of processing to get to the answers we were seeking. As such, it was a good learning experience and with more time and understanding of the trademark registration process, more could be extracted from this dataset.