Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated the README.md, added notebook and visualizations #44

Merged
merged 22 commits into from
May 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
.DS_Store

environ.yaml

# ignore data folder by default
data/

Expand Down
451 changes: 451 additions & 0 deletions analysis/db/chrissi/experimenting_APIs.ipynb

Large diffs are not rendered by default.

1,223 changes: 1,223 additions & 0 deletions analysis/db/mitch/Air_Quality_Viz.ipynb

Large diffs are not rendered by default.

35 changes: 35 additions & 0 deletions analysis/db/mitch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
-What does this figure show?

This figure shows the changes in NO2 concentrations for Milan, Italy and Wuhan, China.

-Why did you choose to look at the values shown in the figure?

Largely because NO2 is a commonly used substance to measure air quality and it was
the most available data I had out of my dataset.

-Why should someone care about these results?

Air pollution, especially from gases like nitrogen dioxide, are major sources of respiratory illnesses
and deaths in numerous cities. Although rooted in unfortunate circumstances, having cleaner air in cities will
help save lives. Levels of air quality can also be an indicator of economic activity and how different governmets
reacted to the pandemic.

-What are the limitations/caveats to your data analysis?

-What decisions did you make in the data processing that could under/overestimate your results

I did not go back very far in my start date for my visualizations. In order to get a fuller view of
air quality changes from COVID-19, I should have at least started a full year back to accomodate other
potential factors such as seasonal changes.

-What external factors could affect your results?

Recent NO2 changes may not have been solely affected by COVID-19 quarantines. There could be countless of other variables
that play a role in air quality. For example, there may be areas that have a high concentration of essential businesses
that require fossil fuel consumption and emission.

-Another way to think about this is this: if you were to present your findings to the public and people are going to take your
word to make some actionable decision, what would you tell them could skew the results and affect the decision making?

I only measured the concentrations of one substance. There are many other pollutants to measure such as SO2, CO, CO2, and particulate matter.

128 changes: 128 additions & 0 deletions analysis/db/somya/riskFactorsCount.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"\n",
"import pandas as pd\n",
"from nltk.stem import PorterStemmer\n",
"import matplotlib.pyplot as plt\n",
"from pyprojroot import here\n",
"\n",
"# load the tab separated file\n",
"df = pd.read_csv(here(\"./data/db/final/kaggle/paper_text/comm_use_subset_pdf_json.tsv\"), sep=\"\\t\")\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# defining risk factor keywords\n",
"targets = {\"smoke\", \"diabetes\", \"neonates\", \"pregnancy\",\n",
" \"pregnant\", \"heart\", \"co-infection\", \"coinfection\", \"comorbidity\"}\n",
"\n",
"\n",
"\n",
"word_stemmer = PorterStemmer()\n",
"map = {}\n",
"# convert target terms into their stemmed versions for successfull matching\n",
"for term in targets:\n",
" stemmedTerm = word_stemmer.stem(term)\n",
" \n",
" map[stemmedTerm] = 0\n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'heart': 1054, 'smoke': 256, 'diabet': 706, 'neonat': 579, 'coinfect': 529, 'pregnant': 496, 'pregnanc': 407, 'co-infect': 835, 'comorbid': 412}\n"
]
}
],
"source": [
"# searching for words in the text\n",
"text_data = df['text']\n",
"for text in text_data:\n",
" for word in map:\n",
" if word in text:\n",
" map[word] = map[word]+1;\n",
"print(map) "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"to_graph = {\"heart disease\": map['heart'], \"pregnancy\": map['pregnanc'],\n",
" \"diabetes\": map[\"diabet\"], \"smoking\": map[\"smoke\"], \n",
" \"coinfection\": map['coinfect']+map['co-infect']+map['comorbid']}\n",
"\n",
"keys = to_graph.keys()\n",
"values = to_graph.values()\n",
"\n",
"plt.figure()\n",
"plt.bar(keys, values)\n",
"plt.title(\"Frequency of Risk Factors\")\n",
"plt.xlabel(\"Risk Factors\")\n",
"plt.ylabel(\"Frequency\")\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading