First release attempt (#18)

v0.1 pre-release * Minimal code (#2) * Setting up github... added requirements.txt to enable dependency tree * First CI test (#3) * Minimal set of files to get tests passing #1 * Config to trigger travis * Remaining code (#7) * Uses setup.py (#10) * Corrected license * bug: backend matplotlib so that it works with Pycharm. Fixes issue #12. (#13) * feat: now shows the number of patents analysed for cpc classification * feat: updated ReadME. Uploaded outputs for ReadME. Also moved fdg outputs to outputs/fdg folder not fdg folder in root directory (cleaner) * Experimenting with code coverage #9 (#17)
datasciencecampus · Aug 21, 2018 · b8a511a · b8a511a
1 parent 378c8cd
commit b8a511a
Show file tree

Hide file tree

Showing 51 changed files with 3,233 additions and 4 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -0,0 +1,13 @@
+[run]
+branch = True
+source = scripts
+
+[report]
+exclude_lines =
+    if self.debug:
+    pragma: no cover
+    raise NotImplementedError
+    if __name__ == .__main__.:
+ignore_errors = True
+omit =
+    tests/*
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,35 @@
+---
+name: Bug report
+about: Create a report to help us improve
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+
+**Desktop (please complete the following information):**
+ - OS: [e.g. iOS]
+ - Browser [e.g. chrome, safari]
+ - Version [e.g. 22]
+
+**Smartphone (please complete the following information):**
+ - Device: [e.g. iPhone6]
+ - OS: [e.g. iOS8.1]
+ - Browser [e.g. stock browser, safari]
+ - Version [e.g. 22]
+
+**Additional context**
+Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,17 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
diff --git a/.gitignore b/.gitignore
@@ -102,3 +102,6 @@ venv.bak/
 
 # mypy
 .mypy_cache/
+
+# PyCharm
+.idea
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,18 @@
+language: python
+python:
+  - "3.6"
+
+install:
+  # command to install dependencies
+  - python setup.py install
+  # also need to download punkt tokeniser data
+  - travis_wait 30 python -m nltk.downloader punkt
+
+script:
+  # for codecov support
+  - pip install pytest pytest-cov
+  # command to run tests
+  - pytest --cov=./
+
+after_success:
+  - bash <(curl -s https://codecov.io/bash)
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,8 @@
+The Open Government Licence (OGL) Version 3
+
+Copyright (c) 2018 Office of National Statistics
+
+This source code is licensed under the Open Government Licence v3.0. To view this
+licence, visit www.nationalarchives.gov.uk/doc/open-government-licence/version/3
+or write to the Information Policy Team, The National Archives, Kew, Richmond,
+Surrey, TW9 4DU.
diff --git a/README.md b/README.md
@@ -1,2 +1,218 @@
-# patent_app_detect
-Derives popular terminology included within a particular patent technology area (CPC classification), based on text analysis of patent abstract information
+[![build status](http://img.shields.io/travis/datasciencecampus/patent_app_detect/master.svg?style=flat)](https://travis-ci.org/datasciencecampus/patent_app_detect)
+[![codecov](https://codecov.io/gh/datasciencecampus/patent_app_detect/branch/master/graph/badge.svg)](https://codecov.io/gh/datasciencecampus/patent_app_detect)
+[![LICENSE.](https://img.shields.io/badge/license-OGL--3-blue.svg?style=flat)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)
+
+# patent_app_detect 
+
+## Description of tool
+
+The tool is designed to derive popular terminology included within a particular patent technology area ([CPC classification](https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/classification/cpc.html)), based on text analysis of patent abstract information.  If the tool is targeted at the [Y02 classification](https://www.epo.org/news-issues/issues/classification/classification.html), for example, identified terms could include 'fuel cell' and 'heat exchanger'. A number of options are provided, for example to provide report, word cloud or graphical output. Some example outputs are shown below:
+
+### Report
+
+The score here is derived from the term [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) values using the Y02 classification on a 10,000 random sample of patents. The terms are all bigrams in this example.
+
+|Term			            |	    TF-IDF Score  |
+| :------------------------ | -------------------:|
+|1. fuel cell               |       2.143778      |
+|2. heat exchanger          |       1.697166      |
+|3. exhaust gas             |       1.496812      |
+|4. combustion engine       |       1.480615      |
+|5. combustion chamber      |       1.390726      |
+|6. energy storage          |       1.302651      |
+|7. internal combustion     |       1.108040      |
+|8. positive electrode      |       1.100686      |
+|9. carbon dioxide          |       1.092638      |
+|10. control unit           |       1.069478      |
+
+### Word cloud
+
+Here is a [wordcloud](https://github.com/datasciencecampus/patent_app_detect/output/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
+
+### Force directed graph
+
+This output provides an [interactive graph](https://github.com/datasciencecampus/patent_app_detect/outputs/fdg/index.html) that shows connections between terms that are generally found in the same patent documents. This example was run for the Y02 classification on a 10,000 random sample of patents.
+
+## How to install
+
+The tool has been developed to work on both Windows and MacOS. To install:
+
+1. Please make sure Python 3.6 is installed and set at your path.  
+   It can be installed from [this location](https://www.python.org/downloads/release/python-360/) selecting the *relevant installer for your opearing system*. When prompted, please check the box to set the paths and environment variables for you and you should be ready to go. Python can also be installed as part of Anaconda [here](https://www.anaconda.com/download/#macos).
+
+   To check the Python version default for your system, run the following in command line/terminal:
+
+   ```
+   python --version
+   ```
+
+   **_Note_**: If Python 2 is the default Python version, but you have installed Python 3.6, your path may be setup to use `python3` instead of `python`.
+
+2. To install the packages and dependencies for the tool, from the root directory (patent_app_detect) run:
+   ``` 
+   pip install -e .
+   ```
+   This will install all the libraries and run some tests. If the tests pass, the app is ready to run. If any of the tests fail, please email thanasis.anthopoulos@ons.gov.uk or ian.grimstead@ons.gov.uk
+   with a screenshot of the failure and we will get back to you.
+
+## How to use
+
+The program is command line driven, and called in the following manner:
+
+```
+python detect.py
+```
+
+The above produces a default report output of top ranked terms, using default parameters. Additional command line arguments provide alternative options, for example a word cloud or force directed graph (fdg) output. The option 'all', produces all three:
+
+```
+python detect.py -o='report' (using just `python detect.py` defaults to this option)
+python detect.py -o='wordcloud'
+python detect.py -o='fdg'
+python detect.py -o='all'
+```
+
+### Choosing patent source
+
+This selects the set of patents for use during analysis. The default source is a pre-created random 1,000 patent dataset from the USPTO, `USPTO-random-1000`. Pre-created datasets of 100, 1,000, 10,000, 100,000, and 500,000 patents are available in the `./data` folder. For example using:
+
+```
+python detect.py -ps=USPTO-random-10000
+```
+
+Will run the tool for a pre-created random dataset of 10,000 patents.
+
+### Choosing CPC classification
+
+This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". In this case a larger patent dataset is generally required to allow for the reduction in patent numbers after subsetting. An example script is:
+
+```
+python detect.py -cpc=Y02 -ps=USPTO-random-10000
+```
+
+In the console the number of subset patents will be stated. For example, for `python detect.py -cpc=Y02 -ps=USPTO-random-10000` the number of Y02 patents is 197. Thus, the tf-idf will be run for 197 patents.
+
+
+### Term n-gram limits
+
+Terms identified may be unigrams, bigrams, or trigrams. The following arguments set the ngram limits for 2-3 word terms (which are the default values).
+```
+python detect.py -mn=2 -mx=3
+```
+
+### Time limits
+This will restrict the patents cohort to only those from 2000 up to now.
+
+```
+python detect.py -yf=2000
+```
+
+This will restrict the patents cohort to only those between 2000 - 2016.
+
+```
+python detect.py -yf=2000 -yt=2016
+```
+### Time weighting
+
+This option applies a linear weight that starts from 0.01 and ends at 1 between the time limits.
+```
+python detect.py -t
+```
+
+### Citation weighting
+
+This will weight the term tfidf scores by the number of citations each patent has. The weight is a normalised value between 0 and 1 with the higher the number indicating a higher number of citations.
+
+```
+python detect.py -c
+```
+
+### Term focus
+
+This option utilises a second random patent dataset, by default `USPTO-random-10000`, whose terms are discounted from the chosen CPC classification to try and 'focus' the identified terms away from terms found more generally in the patent dataset. An example of choosing a larger 
+
+```
+python detect.py -f
+```
+
+### Choose focus source
+
+This selects the set of patents for use during the term focus option, for example for a larger dataset.
+
+```
+python detect.py -fs=USPTO-random-100000
+```
+
+### Config files
+
+There are three configuration files available inside the config directory:
+
+- stopwords_glob.txt
+- stopwords_n.txt
+- stopwords_uni.txt
+
+The first file (stopwords_glob.txt) contains stopwords that are applied to all ngrams.
+The second file contains stopwords that are applied to all n-grams for n>1 and the last file (stopwords_uni.txt) contain stopwords that apply only to unigrams. The users can append stopwords into this files, to stop undesirable output terms.
+
+## Help
+
+A help function details the range and usage of these command line arguments:
+```
+python detect.py -h
+```
+
+An edited version of the help output is included below. This starts with a summary of arguments:
+
+```
+python detect.py -h
+usage: detect.py [-h] [-f] [-c] [-t] [-p {median,max,sum,avg}]
+                 [-o {fdg,wordcloud,report,all}] [-yf YEAR_FROM] [-yt YEAR_TO]
+                 [-np NUM_NGRAMS_REPORT] [-nd NUM_NGRAMS_WORDCLOUD]
+                 [-nf NUM_NGRAMS_FDG] [-ps PATENT_SOURCE] [-fs FOCUS_SOURCE]
+                 [-mn {1,2,3}] [-mx {1,2,3}] [-rn REPORT_NAME]
+                 [-wn WORDCLOUD_NAME] [-wt WORDCLOUD_TITLE]
+                 [-cpc CPC_CLASSIFICATION]
+
+create report, wordcloud, and fdg graph for patent texts
+
+```
+It continues with a detailed description of the arguments:
+```
+optional arguments:
+  -h, --help            show this help message and exit
+  -f, --focus           clean output from terms that appear in general
+  -c, --cite            weight terms by citations
+  -t, --time            weight terms by time
+  -p {median,max,sum,avg}, --pick {median,max,sum,avg}
+                        options are <median> <max> <sum> <avg> defaults to
+                        sum. Average is over non zero values
+  -o {fdg,wordcloud,report,all}, --output {fdg,wordcloud,report,all}
+                        options are: <fdg> <wordcloud> <report> <all>
+  -yf YEAR_FROM, --year_from YEAR_FROM
+                        The first year for the patent cohort
+  -yt YEAR_TO, --year_to YEAR_TO
+                        The last year for the patent cohort (0 is now)
+  -np NUM_NGRAMS_REPORT, --num_ngrams_report NUM_NGRAMS_REPORT
+                        number of ngrams to return for report
+  -nd NUM_NGRAMS_WORDCLOUD, --num_ngrams_wordcloud NUM_NGRAMS_WORDCLOUD
+                        number of ngrams to return for wordcloud
+  -nf NUM_NGRAMS_FDG, --num_ngrams_fdg NUM_NGRAMS_FDG
+                        number of ngrams to return for fdg graph
+  -ps PATENT_SOURCE, --patent_source PATENT_SOURCE
+                        the patent source to process
+  -fs FOCUS_SOURCE, --focus_source FOCUS_SOURCE
+                        the patent source for the focus function
+  -mn {1,2,3}, --min_n {1,2,3}
+                        the minimum ngram value
+  -mx {1,2,3}, --max_n {1,2,3}
+                        the maximum ngram value
+  -rn REPORT_NAME, --report_name REPORT_NAME
+                        report filename
+  -wn WORDCLOUD_NAME, --wordcloud_name WORDCLOUD_NAME
+                        wordcloud filename
+  -wt WORDCLOUD_TITLE, --wordcloud_title WORDCLOUD_TITLE
+                        wordcloud title
+  -cpc CPC_CLASSIFICATION, --cpc_classification CPC_CLASSIFICATION
+                        the desired cpc classification
+
+```