Skip to content

Commit

Permalink
First release attempt (#18)
Browse files Browse the repository at this point in the history
v0.1 pre-release
* Minimal code (#2)
* Setting up github... added requirements.txt to enable dependency tree
* First CI test (#3)
* Minimal set of files to get tests passing #1
* Config to trigger travis
* Remaining code (#7)
* Uses setup.py (#10)
* Corrected license
* bug: backend matplotlib so that it works with Pycharm. Fixes issue #12. (#13)
* feat: now shows the number of patents analysed for cpc classification
* feat: updated ReadME. Uploaded outputs for ReadME. Also moved fdg outputs to outputs/fdg folder not fdg folder in root directory (cleaner)
* Experimenting with code coverage #9 (#17)
  • Loading branch information
IanGrimstead committed Aug 21, 2018
1 parent 378c8cd commit b8a511a
Show file tree
Hide file tree
Showing 51 changed files with 3,233 additions and 4 deletions.
13 changes: 13 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[run]
branch = True
source = scripts

[report]
exclude_lines =
if self.debug:
pragma: no cover
raise NotImplementedError
if __name__ == .__main__.:
ignore_errors = True
omit =
tests/*
35 changes: 35 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
name: Bug report
about: Create a report to help us improve

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]

**Smartphone (please complete the following information):**
- Device: [e.g. iPhone6]
- OS: [e.g. iOS8.1]
- Browser [e.g. stock browser, safari]
- Version [e.g. 22]

**Additional context**
Add any other context about the problem here.
17 changes: 17 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
name: Feature request
about: Suggest an idea for this project

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,6 @@ venv.bak/

# mypy
.mypy_cache/

# PyCharm
.idea
18 changes: 18 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
language: python
python:
- "3.6"

install:
# command to install dependencies
- python setup.py install
# also need to download punkt tokeniser data
- travis_wait 30 python -m nltk.downloader punkt

script:
# for codecov support
- pip install pytest pytest-cov
# command to run tests
- pytest --cov=./

after_success:
- bash <(curl -s https://codecov.io/bash)
8 changes: 8 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
The Open Government Licence (OGL) Version 3

Copyright (c) 2018 Office of National Statistics

This source code is licensed under the Open Government Licence v3.0. To view this
licence, visit www.nationalarchives.gov.uk/doc/open-government-licence/version/3
or write to the Information Policy Team, The National Archives, Kew, Richmond,
Surrey, TW9 4DU.
220 changes: 218 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,218 @@
# patent_app_detect
Derives popular terminology included within a particular patent technology area (CPC classification), based on text analysis of patent abstract information
[![build status](http://img.shields.io/travis/datasciencecampus/patent_app_detect/master.svg?style=flat)](https://travis-ci.org/datasciencecampus/patent_app_detect)
[![codecov](https://codecov.io/gh/datasciencecampus/patent_app_detect/branch/master/graph/badge.svg)](https://codecov.io/gh/datasciencecampus/patent_app_detect)
[![LICENSE.](https://img.shields.io/badge/license-OGL--3-blue.svg?style=flat)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)

# patent_app_detect

## Description of tool

The tool is designed to derive popular terminology included within a particular patent technology area ([CPC classification](https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/classification/cpc.html)), based on text analysis of patent abstract information. If the tool is targeted at the [Y02 classification](https://www.epo.org/news-issues/issues/classification/classification.html), for example, identified terms could include 'fuel cell' and 'heat exchanger'. A number of options are provided, for example to provide report, word cloud or graphical output. Some example outputs are shown below:

### Report

The score here is derived from the term [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) values using the Y02 classification on a 10,000 random sample of patents. The terms are all bigrams in this example.

|Term | TF-IDF Score |
| :------------------------ | -------------------:|
|1. fuel cell | 2.143778 |
|2. heat exchanger | 1.697166 |
|3. exhaust gas | 1.496812 |
|4. combustion engine | 1.480615 |
|5. combustion chamber | 1.390726 |
|6. energy storage | 1.302651 |
|7. internal combustion | 1.108040 |
|8. positive electrode | 1.100686 |
|9. carbon dioxide | 1.092638 |
|10. control unit | 1.069478 |

### Word cloud

Here is a [wordcloud](https://github.com/datasciencecampus/patent_app_detect/output/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.

### Force directed graph

This output provides an [interactive graph](https://github.com/datasciencecampus/patent_app_detect/outputs/fdg/index.html) that shows connections between terms that are generally found in the same patent documents. This example was run for the Y02 classification on a 10,000 random sample of patents.

## How to install

The tool has been developed to work on both Windows and MacOS. To install:

1. Please make sure Python 3.6 is installed and set at your path.
It can be installed from [this location](https://www.python.org/downloads/release/python-360/) selecting the *relevant installer for your opearing system*. When prompted, please check the box to set the paths and environment variables for you and you should be ready to go. Python can also be installed as part of Anaconda [here](https://www.anaconda.com/download/#macos).

To check the Python version default for your system, run the following in command line/terminal:

```
python --version
```

**_Note_**: If Python 2 is the default Python version, but you have installed Python 3.6, your path may be setup to use `python3` instead of `python`.

2. To install the packages and dependencies for the tool, from the root directory (patent_app_detect) run:
```
pip install -e .
```
This will install all the libraries and run some tests. If the tests pass, the app is ready to run. If any of the tests fail, please email thanasis.anthopoulos@ons.gov.uk or ian.grimstead@ons.gov.uk
with a screenshot of the failure and we will get back to you.

## How to use

The program is command line driven, and called in the following manner:

```
python detect.py
```

The above produces a default report output of top ranked terms, using default parameters. Additional command line arguments provide alternative options, for example a word cloud or force directed graph (fdg) output. The option 'all', produces all three:

```
python detect.py -o='report' (using just `python detect.py` defaults to this option)
python detect.py -o='wordcloud'
python detect.py -o='fdg'
python detect.py -o='all'
```

### Choosing patent source

This selects the set of patents for use during analysis. The default source is a pre-created random 1,000 patent dataset from the USPTO, `USPTO-random-1000`. Pre-created datasets of 100, 1,000, 10,000, 100,000, and 500,000 patents are available in the `./data` folder. For example using:

```
python detect.py -ps=USPTO-random-10000
```

Will run the tool for a pre-created random dataset of 10,000 patents.

### Choosing CPC classification

This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". In this case a larger patent dataset is generally required to allow for the reduction in patent numbers after subsetting. An example script is:

```
python detect.py -cpc=Y02 -ps=USPTO-random-10000
```

In the console the number of subset patents will be stated. For example, for `python detect.py -cpc=Y02 -ps=USPTO-random-10000` the number of Y02 patents is 197. Thus, the tf-idf will be run for 197 patents.


### Term n-gram limits

Terms identified may be unigrams, bigrams, or trigrams. The following arguments set the ngram limits for 2-3 word terms (which are the default values).
```
python detect.py -mn=2 -mx=3
```

### Time limits
This will restrict the patents cohort to only those from 2000 up to now.

```
python detect.py -yf=2000
```

This will restrict the patents cohort to only those between 2000 - 2016.

```
python detect.py -yf=2000 -yt=2016
```
### Time weighting

This option applies a linear weight that starts from 0.01 and ends at 1 between the time limits.
```
python detect.py -t
```

### Citation weighting

This will weight the term tfidf scores by the number of citations each patent has. The weight is a normalised value between 0 and 1 with the higher the number indicating a higher number of citations.

```
python detect.py -c
```

### Term focus

This option utilises a second random patent dataset, by default `USPTO-random-10000`, whose terms are discounted from the chosen CPC classification to try and 'focus' the identified terms away from terms found more generally in the patent dataset. An example of choosing a larger

```
python detect.py -f
```

### Choose focus source

This selects the set of patents for use during the term focus option, for example for a larger dataset.

```
python detect.py -fs=USPTO-random-100000
```

### Config files

There are three configuration files available inside the config directory:

- stopwords_glob.txt
- stopwords_n.txt
- stopwords_uni.txt

The first file (stopwords_glob.txt) contains stopwords that are applied to all ngrams.
The second file contains stopwords that are applied to all n-grams for n>1 and the last file (stopwords_uni.txt) contain stopwords that apply only to unigrams. The users can append stopwords into this files, to stop undesirable output terms.

## Help

A help function details the range and usage of these command line arguments:
```
python detect.py -h
```

An edited version of the help output is included below. This starts with a summary of arguments:

```
python detect.py -h
usage: detect.py [-h] [-f] [-c] [-t] [-p {median,max,sum,avg}]
[-o {fdg,wordcloud,report,all}] [-yf YEAR_FROM] [-yt YEAR_TO]
[-np NUM_NGRAMS_REPORT] [-nd NUM_NGRAMS_WORDCLOUD]
[-nf NUM_NGRAMS_FDG] [-ps PATENT_SOURCE] [-fs FOCUS_SOURCE]
[-mn {1,2,3}] [-mx {1,2,3}] [-rn REPORT_NAME]
[-wn WORDCLOUD_NAME] [-wt WORDCLOUD_TITLE]
[-cpc CPC_CLASSIFICATION]
create report, wordcloud, and fdg graph for patent texts
```
It continues with a detailed description of the arguments:
```
optional arguments:
-h, --help show this help message and exit
-f, --focus clean output from terms that appear in general
-c, --cite weight terms by citations
-t, --time weight terms by time
-p {median,max,sum,avg}, --pick {median,max,sum,avg}
options are <median> <max> <sum> <avg> defaults to
sum. Average is over non zero values
-o {fdg,wordcloud,report,all}, --output {fdg,wordcloud,report,all}
options are: <fdg> <wordcloud> <report> <all>
-yf YEAR_FROM, --year_from YEAR_FROM
The first year for the patent cohort
-yt YEAR_TO, --year_to YEAR_TO
The last year for the patent cohort (0 is now)
-np NUM_NGRAMS_REPORT, --num_ngrams_report NUM_NGRAMS_REPORT
number of ngrams to return for report
-nd NUM_NGRAMS_WORDCLOUD, --num_ngrams_wordcloud NUM_NGRAMS_WORDCLOUD
number of ngrams to return for wordcloud
-nf NUM_NGRAMS_FDG, --num_ngrams_fdg NUM_NGRAMS_FDG
number of ngrams to return for fdg graph
-ps PATENT_SOURCE, --patent_source PATENT_SOURCE
the patent source to process
-fs FOCUS_SOURCE, --focus_source FOCUS_SOURCE
the patent source for the focus function
-mn {1,2,3}, --min_n {1,2,3}
the minimum ngram value
-mx {1,2,3}, --max_n {1,2,3}
the maximum ngram value
-rn REPORT_NAME, --report_name REPORT_NAME
report filename
-wn WORDCLOUD_NAME, --wordcloud_name WORDCLOUD_NAME
wordcloud filename
-wt WORDCLOUD_TITLE, --wordcloud_title WORDCLOUD_TITLE
wordcloud title
-cpc CPC_CLASSIFICATION, --cpc_classification CPC_CLASSIFICATION
the desired cpc classification
```
Loading

0 comments on commit b8a511a

Please sign in to comment.