# CSCI 4253 / 5253 - Lab #4 - Patent Problem with Spark RDD - SOLUTION
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

This [Spark cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf) is useful

In [1]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

In [2]:
conf=SparkConf().setAppName("Lab4-rdd").setMaster("local[*]")
sc = SparkContext(conf=conf)

In [3]:
import urllib.request
import os

# URLs from your makefile
urls = [
    'https://github.com/cu-csci-4253-datacenter/lab4-pyspark-patent-data/raw/master/cite75_99.txt.gz',
    'https://github.com/cu-csci-4253-datacenter/lab4-pyspark-patent-data/raw/master/apat63_99.txt.gz'
]

# Corresponding local filenames
filenames = [
    'cite75_99.txt.gz',
    'apat63_99.txt.gz'
]

# Download files with progress
for url, filename in zip(urls, filenames):
    if not os.path.exists(filename):
        print(f'Downloading {filename}...')
        urllib.request.urlretrieve(url, filename)
        print(f'{filename} downloaded successfully!')
        print(f'File size: {os.path.getsize(filename) / (1024*1024):.2f} MB')
    else:
        print(f'{filename} already exists, skipping download.')

print('All files ready!')

cite75_99.txt.gz already exists, skipping download.
apat63_99.txt.gz already exists, skipping download.
All files ready!


Using PySpark and RDD's on the https://coding.csel.io machines is slow -- most of the code is executed in Python and this is much less efficient than the java-based code using the PySpark dataframes. Be patient and trying using `.cache()` to cache the output of joins. You may want to start with a reduced set of data before running the full task. You can use the `sample()` method to extract just a sample of the data or use 

These two RDD's are called "rawCitations" and "rawPatents" because you probably want to process them futher (e.g. convert them to integer types, etc). 

The `textFile` function returns data in strings. This should work fine for this lab.

Other methods you use might return data in type `Byte`. If you haven't used Python `Byte` types before, google it. You can convert a value of `x` type byte into e.g. a UTF8 string using `x.decode('uft-8')`. Alternatively, you can use the `open` method of the gzip library to read in all the lines as UTF-8 strings like this:
```
import gzip
with gzip.open('cite75_99.txt.gz', 'rt',encoding='utf-8') as f:
    rddCitations = sc.parallelize( f.readlines() )
```
This is less efficient than using `textFile` because `textFile` would use the underlying HDFS or other file system to read the file across all the worker nodes while the using `gzip.open()...readlines()` will read all the data in the frontend and then distribute it to all the worker nodes.

In [4]:
rddCitations = sc.textFile("cite75_99.txt.gz")
rddPatents = sc.textFile("apat63_99.txt.gz")

The data looks like the following.

In [5]:
rddCitations.take(5)

['"CITING","CITED"',
 '3858241,956203',
 '3858241,1324234',
 '3858241,3398406',
 '3858241,3557384']

In [6]:
rddPatents.take(5)

['"PATENT","GYEAR","GDATE","APPYEAR","COUNTRY","POSTATE","ASSIGNEE","ASSCODE","CLAIMS","NCLASS","CAT","SUBCAT","CMADE","CRECEIVE","RATIOCIT","GENERAL","ORIGINAL","FWDAPLAG","BCKGTLAG","SELFCTUB","SELFCTLB","SECDUPBD","SECDLWBD"',
 '3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,,',
 '3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,,',
 '3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,,',
 '3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,,']

In [7]:
citations_parsed = rddCitations.map(lambda line: line.split(",")) \
                               .map(lambda x: (x[0].strip('"'), x[1].strip('"')))

In [8]:
# Parsing patents data into (patent_id, patent_data_list)
patents_parsed = rddPatents.map(lambda line: line.split(",")) \
                           .map(lambda x: (x[0].strip('"'), x[1:]))

In [9]:
# Creating pairs of patents data and Extracting state info
patent_states = patents_parsed.map(lambda x: (x[0], x[1][4].strip('"') if len(x[1]) > 4 and x[1][4].strip('"') != '' else None)) \
                              .filter(lambda x: x[1] is not None and x[1] != 'null')

In [10]:
# Joining citations with citing patent states
citations_with_citing_state = citations_parsed.join(patent_states)

In [11]:
# Rearranging for next join
citations_rearranged = citations_with_citing_state.map(lambda x: (x[1][0], (x[0], x[1][1])))

In [12]:
# Creating a join with cited patent states
citations_with_both_states = citations_rearranged.join(patent_states)

In [13]:
# Filtering out the same state citations
same_state_cites = citations_with_both_states.filter(lambda x: x[1][0][1] == x[1][1])

In [14]:
# Counting same state citations per patent
from operator import add
same_state_counts = same_state_cites.map(lambda x: (x[1][0][0], 1)).reduceByKey(add)

In [15]:
# Left Join with all the patents
patents_with_counts = patents_parsed.leftOuterJoin(same_state_counts)

In [16]:
# Filtering out only US patents
us_patents = patents_with_counts.filter(lambda x: len(x[1][0]) > 3 and x[1][0][3].strip('"') == 'US')

In [17]:
# Formating and sorting results
us_patents_formatted = us_patents.map(lambda x: (
    x[0],  # patent_id
    x[1][0],  # patent_data 
    x[1][1] if x[1][1] is not None else 0  # same_state_count
)).sortBy(lambda x: (-x[2], x[0]))

In [18]:
# Fetching top 10 results
top_10 = us_patents_formatted.take(10)

In [19]:
# Fixing Proper column alignment with exact widths
print("TOP 10 US PATENTS BY SAME-STATE CITATIONS")
print("=" * 200)

# Defining exact column widths for perfect alignment
col_widths = [8, 6, 8, 8, 8, 8, 12, 8, 7, 7, 5, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 11]
header = ['PATENT', 'GYEAR', 'GDATE', 'APPYEAR', 'COUNTRY', 'POSTATE', 'ASSIGNEE', 'ASSCODE', 
          'CLAIMS', 'NCLASS', 'CAT', 'SUBCAT', 'CMADE', 'CRECEIVE', 'RATIOCIT', 'GENERAL', 
          'ORIGINAL', 'FWDAPLAG', 'BCKGTLAG', 'SELFCTUB', 'SELFCTLB', 'SECDUPBD', 'SECDLWBD', 'SAME_STATE']

# Printing header with exact spacing
header_line = ""
for i, col in enumerate(header):
    header_line += f"{col:<{col_widths[i]}}"
print(header_line)

# Printing separator
print("-" * sum(col_widths))

# Printing data rows with exact spacing
for patent_id, patent_data, count in top_10:
    # Cleaning and prepare data
    clean_data = [patent_id]  # Start with patent ID
    
    # Adding patent fields (first 22 fields)
    for i, field in enumerate(patent_data[:22]):
        field = field.strip('"')
        if field == '' or field is None:
            clean_data.append('null')
        else:
            clean_data.append(field)
    
    # Adding same-state count
    clean_data.append(str(count))
    
    # Formating row with exact column widths
    row_line = ""
    for i, value in enumerate(clean_data):
        if i < len(col_widths):
            if len(str(value)) > col_widths[i] - 1:
                value = str(value)[:col_widths[i] - 2] + ".."
            row_line += f"{str(value):<{col_widths[i]}}"
    
    print(row_line)

TOP 10 US PATENTS BY SAME-STATE CITATIONS
PATENT  GYEAR GDATE   APPYEAR COUNTRY POSTATE ASSIGNEE    ASSCODE CLAIMS NCLASS CAT  SUBCAT CMADE  CRECEIVE RATIOCIT GENERAL  ORIGINAL FWDAPLAG BCKGTLAG SELFCTUB SELFCTLB SECDUPBD SECDLWBD SAME_STATE 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5959466 1999  14515   1997    US      CA      5310        2       null   326    4    46     159    0        1        null     0.6186   null     4.8868   0.0455   0.044    null     null     125        
5983822 1999  14564   1998    US      TX      569900      2       null   114    5    55     200    0        0.995    null     0.7201   null     12.45    0        0        null     null     103        
6008204 1999  14606   1998    US      CA      749584      2       null   514    3    31     121    0        1        null     0.7415   null     5        0

In other words, they are a single string with multiple CSV's. You will need to convert these to (K,V) pairs, probably convert the keys to `int` and so on. You'll need to `filter` out the header string as well since there's no easy way to extract all the lines except the first.