Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we get rid of any of this data? #15

Open
dccabs opened this issue Feb 21, 2017 · 5 comments
Open

Can we get rid of any of this data? #15

dccabs opened this issue Feb 21, 2017 · 5 comments

Comments

@dccabs
Copy link
Owner

dccabs commented Feb 21, 2017

** updated with some 2017 info **

here's an example of our data set from 2017. All years look this same I just chose this one because it's small. If you see something we don't need, Let me know. Give me me the key, and value.

Like say we don't need applicationNumberText.value and i'll know what you mean.

{
		"applicationDataOrProsecutionHistoryDataOrPatentTermData": [{
			"applicationNumberText": {
				"value": "14434661",
				"electronicText": "14434661"
			},
			"filingDate": "2017-01-18",
			"applicationTypeCategory": "UTILITY",
			"partyBag": {
				"applicantBagOrInventorBagOrOwnerBag": [{
					"partyIdentifierOrContact": [{
						"value": "1009"
					}]
				}, {
					"inventorOrDeceasedInventor": [{
						"contactOrPublicationContact": [{
							"name": {
								"personNameOrOrganizationNameOrEntityName": [{
									"personStructuredName": {
										"firstName": "Jinming",
										"lastName": "Cui"
									}
								}]
							},
							"cityName": "Guangzhou City, Guangdong",
							"countryCode": "CN"
						}],
						"sequenceNumber": "1"
					}, {
						"contactOrPublicationContact": [{
							"name": {
								"personNameOrOrganizationNameOrEntityName": [{
									"personStructuredName": {
										"firstName": "Shijie",
										"lastName": "Zeng"
									}
								}]
							},
							"cityName": "Guangzhou City, Guangdong",
							"countryCode": "CN"
						}],
						"sequenceNumber": "2"
					}, {
						"contactOrPublicationContact": [{
							"name": {
								"personNameOrOrganizationNameOrEntityName": [{
									"personStructuredName": {
										"firstName": "Olaf",
										"lastName": "Eichstaedt"
									}
								}]
							},
							"cityName": "Guangzhou City, Guangdong",
							"countryCode": "CN"
						}],
						"sequenceNumber": "3"
					}, {
						"contactOrPublicationContact": [{
							"name": {
								"personNameOrOrganizationNameOrEntityName": [{
									"personStructuredName": {
										"firstName": "Jiandong",
										"lastName": "Huang"
									}
								}]
							},
							"cityName": "Guangzhou City, Guangdong",
							"countryCode": "CN"
						}],
						"sequenceNumber": "4"
					}, {
						"contactOrPublicationContact": [{
							"name": {
								"personNameOrOrganizationNameOrEntityName": [{
									"personStructuredName": {
										"firstName": "Ruxu",
										"lastName": "Du"
									}
								}]
							},
							"cityName": "Guangzhou City, Guangdong",
							"countryCode": "CN"
						}],
						"sequenceNumber": "5"
					}]
				}, {
					"primaryExaminerOrAssistantExaminerOrAuthorizedOfficer": [{
						"name": {
							"personNameOrOrganizationNameOrEntityName": [{
								"personStructuredName": {
									"lastName": "-"
								}
							}]
						}
					}]
				}]
			},
			"groupArtUnitNumber": {
				"value": "1799",
				"electronicText": "1799"
			},
			"applicationConfirmationNumber": "1320",
			"applicantFileReference": "1971-006",
			"patentClassificationBag": {
				"cpcClassificationBagOrIPCClassificationOrECLAClassificationBag": [{
					"ipcrClassification": [{
						"patentClassificationText": "435"
					}, {
						"patentClassificationText": "288.700"
					}]
				}]
			},
			"businessEntityStatusCategory": "SMALL",
			"firstInventorToFileIndicator": true,
			"inventionTitle": {
				"content": ["Device for Cell Culturing and Processing"]
			},
			"applicationStatusCategory": "Application Dispatched from Preexam, Not Yet Docketed",
			"applicationStatusDate": "2017-02-02",
			"officialFileLocationCategory": "ELECTRONIC",
			"patentPublicationIdentification": {
				"publicationNumber": "0"
			},
			"patentGrantIdentification": {
				"patentNumber": "0"
			}
		}, null, {
			"applicationPublication": {
				"patentPublicationIdentification": {
					"publicationNumber": " 0  ",
					"publicationDate": "0001-01-01"
				},
				"webURI": "http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=0 .PGNR.&OS=DN/0 &RS=DN/0 "
			},
			"grantPublication": {
				"patentGrantIdentification": {
					"patentNumber": "0"
				},
				"webURI": " "
			}
		}],
		"st96Version": "V2_0",
		"ipoVersion": "US_V6_0"
	}

@absoluke
Copy link

Hmmm....this is a great questions. Ima tag chris on this, too...i am wondering/questioning wondering if there is any information that we wouldn't want in this API, just in case....im adding chris and will look closer and contemplate further...this is like a foreign language to me...i have no programming background except fortran in first year of college.

@dccabs
Copy link
Owner Author

dccabs commented Feb 22, 2017

Hey Luke, I removed most of the data, now it's just one single Patent Record.

This is just the format that the data is stored in. If you go to tha pair bulk data website and search by application Number, put in this number "14434661" and you will see the visual representation of this same data on their website.

Totally not a big deal if we want to keep it all. I'm just running some batch scripts to edit all these files at once and figured if we didn't need any of this shit, i'd remove it.

@absoluke
Copy link

There are some fields that are less needed than others, any many important fields are blank on this example because its is such a newly filed application that hasn't published yet in US, but all in all, we may need all of these fields at some point or another, depending upon the info we need to serve to the website request or clients through alerts. Now, based upon what I am seeing, I think something I feared is going on with this data. If this is all the data that can be gotten for any record, we are limited to certain functionality, like being able to serve up "Status" and "Status" date which is real important and the first feature we'd like to provide in bulk and/or with the "soft wall". The data im seeing in this code above is the basic application data for this record. Go to the traditional Public Pair Interface at http://portal.uspto.gov/pair/PublicPair, enter captchs, and enter that application no and see what i mean. The data in the code above represents the "Application Data" tab. Notice that there are several other tabs of data, some meta, and some with images, namely the following Tabs:

Transaction History
Image File Wrapper
Continuity Data
Foreign Priority
Address & Attorney Agent
Assignments; and
Display References

I am almost certain that to provide a full-fledged competing PAIR monitoring service, we are going to have to eventually bulk up the API to use Reed Tech's scraped data and image data from these other tabs. As way of an important example, many customers will want to know when and only when a specific "type" of event happens in the "Transaction History" or "Prosecution History" Tabs because these events will trigger certain things they need to look up or tell their clients. The best way to see this is to click on the "Image File Wrapper" tab and see all the document images that are avail for download. To the left of each one is a "Document Code", and I've seen somewhere in my past a document that identifies what each Document Code means. So, our competitors like cardinal (our former employer) and reed tech scrape the images/pdfs and document codes daily, and send email alerts to customer when a specific code happens, and they also include a link to or an attached PDF of the document. So paying customer gets alert day after something happens, simply clicks and review the document. It is intelligence on demand without having to go into the main public pair website.

So that leads me to one simple question. Are you seeing any OTHER data in the JSON or XML or otherwise that is indicative of at least the text/metada that is present on any of those other tabs. If not, right now the max capability of our tool (albeit still very useful) is to give current status for a big list of applications all at once along with status date and title of invention or any of the other fields in that json above. However, to eventually offer a fully competing alert service and bury the competition and pay for our vacation homes in the Carribbean, we will need to ultimately API out all of the PAIR tabs, and images and provide full featured alerts at a better price, and with a better more user friendly interface. Just wanted to make sure you are on same page. However, if there is more data in these JSONs beside just the "Application" tab data, we can probably provide some great features even without fucking with the images or the "Document" codes at this time...Hell if we have the documents codes somewhere in there...we can tell customers exactly what's happening just without images, which they can go and download themselves, when something hits an important code state. Also companies take the image data and sell full file histories at several bucks per applicaiton. Make sense?

That will be the difference between them and us, at least at the outset...but the funny thing is that, hundreds of thousands of times per day, web users go to Public Pair just to retrieve "status" which is definitely in your JSON data sample above...and no one, that I am aware, is serving it for free, other than the uspto, and no one is letting you enter in a full list and quickly grabbing statuses all at once. So if the answer to my question above is, "shit luke, its only the application tab data", then we still have a valuable service and the question is how to monetize just that with softwall's, verified email addresses, credit card accounts etc. Positive note, that data we may not have from this JSON has to be distributed by Reed Tech, and as we grow, we use the data and a better service, better brand, to fully bury the 5 retards doing this in the industry. And I mean they are retards.

Let me know if I have confused you, and more importantly, let me know if there are any other data fields than just these in the application data tab. Here's a screenshot of the tabs on traditional public pair website:
screen shot 2017-02-21 at 9 45 11 pm

@clmulk
Copy link
Collaborator

clmulk commented Feb 22, 2017 via email

@dccabs
Copy link
Owner Author

dccabs commented Feb 22, 2017

To answer your question Luke, I don't think so. I'm pretty sure that's all the data the PAIR public API is giving us.

Basically if you can't do it on the pair bulk data site, you can't do it on ours. and vice versa, anything you can do on theirs you can do on ours, but with unlimited requests.

My thinking is along the line's of Chris's. We set out to build a service that is a batch request tool for several types of numbers and to get statuses right? Let's accomplish that, and make it awesome. Then we'll start iterating on top of that. We'll do as much as we can with our api.

Getting into the business of scraping html pages isn't really my cup of tea. It takes way too much time to set that up, and to maintain it. But let's cross that bridge when we get to it.

Right now I see the plan as this.

  1. Get the api mirror site up and running.
  2. Set up the mechanism for it to update on a daily basis.
  3. Build the batch request tool (already have a semi working prototype for this part).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants