In [1]:
#standard imports
import pandas as pd
import numpy as np

#API tools
import requests
import json
from pandas.io.json import json_normalize
from urllib.request import Request, urlopen
import json

#visuals
import matplotlib as plt
import seaborn as sns

#Natural Language Processing
import nltk
import lda #Latent Dirichlet Allocation (create topics)
import gensim
from gensim import corpora, models #for constructing document term matrix
from nltk.stem.porter import PorterStemmer
from nltk import stem
from nltk.corpus import stopwords

#magic
%matplotlib inline

In [2]:
pd.set_option('display.float_format', lambda x: '%.3f' % x) #otherwise we have scientific notation

# Our Question: Among state and city government open datasets hosted on the Socrata portal, what dataset topics are the most popular as of August 2017?

## However, this (unfortunately) isn't as easy as choosing a consistent category like "crime and public safety" and adding up views and downloads. We have two obstacles:
- We need to control for the fact that some datasets just get more views because they are from popular portals. Our goal is determine what *types* of datasets are most popular so all cities and states can use this information. We don't want our conclusion to be that everyone should release data on "New York Lottery" (this is not a joke; our initial analysis found that to be a very popular tag).
- Datasets themselves vary widely, and not all metadata is standardized or consistent. Some datasets are in broad categories, like annual budget, and some are very niche and specific, like dog licenses. Compounding this challenge is that cities choose all sorts of different tags and categories for similar data; 

### Therefore, we need a way to extract true topics that cut across multiple cities' and states' datasets.
- We want "budget", "fiscal year" and "expenditures" to all be treated as a single topic.
- **Fortunately, a machine learning model exists to do just this.** *Latent Dirichlet Allocation* is an algorithm that finds "hidden" topics in a group of documents. It posits that each "document" (in this case, our combined metadata) is a mixture of topics, and that each word in a document is derived from a topic. LDA will produce "topics" composed of words and their expected distribution across that topic. While expected distribution is not very intuitive to interpret, the word combinations in topics is extremely easy to interpret.
    - **In order to run LDA on this dataset, we processed our metadata by combining "description", "domain_tags" and "categories" attributes into a new attribute called "mash".** Each row of "mash" will be a document in our model.
        - We excluded dataset name and domain; these are often very specific, parochial tags that will inhibit our efforts to form "topics" that can apply across cities and datasets. As we will see later, proper and parochial names skew our model and we want to avoid them.

## Read in a saved dataframe created from Socrata API call on 8.14.17
- Unpacked JSON dicts returned by API call.
- Contains data on nearly 30,000 dataset's metadata, including description, topic tags, categories, views, and downloads.
- **This metadata is not in a standard format**. City and state governments write their own descriptions and choose their own tags. One government may tag their annual budget as "budget", while another may choose "fiscal year", and yet another just "government".
- **This metadata is often missing one or more attributes**. Some governments don't include a description, or don't include tags.

In [3]:
big_mash = pd.read_csv('big_mash_archive_8_14.14')

In [4]:
big_mash.head()

Unnamed: 0.1,Unnamed: 0,name,description,attribution,columns_field_name,columns_name,type,categories,domain_category,domain_tags,provenance,download_count,page_views_last_month,page_views_last_week,page_views_total,page_views_total_log,domain,mash,big_mash
0,0,Homelessness PIT Transitional Age Youth,[],,['location_on_the_night_of_the_count_total_per...,"['Location on the night of the count, Total Pe...",chart,[],,"['point in time', 'homelessness', 'ss', 'dchs']",official,4.0,46.0,2.0,319.0,8.322,dashboard.alexandriava.gov,"['point', 'in', 'time', 'homelessness', 'ss', ...","['point', 'in', 'time', 'homelessness', 'ss', ..."
1,1,Fair Housing Complaints,[],,"['violations', 'percent_found_to_be_compliant'...","['Number of complaints', 'Percent of sites fou...",chart,[],,['housing'],official,21.0,31.0,0.0,278.0,8.124,dashboard.alexandriava.gov,['housing'],['housing']
2,2,Parking Complaints Bar Chart,[],,['percent_of_valid_parking_meter_problem_servi...,['Percent of valid parking meter problem servi...,chart,['transportation'],,[],official,17.0,17.0,2.0,274.0,8.103,dashboard.alexandriava.gov,['transportation'],['transportation']
3,3,NVMHI Admissions,[],,"['lipos_admissions_per_100k', 'nvmhi_admission...","['LIPOS Admissions per 100K', 'NVMHI Admission...",chart,[],,"['delete', 'dchs']",official,5.0,38.0,1.0,263.0,8.044,dashboard.alexandriava.gov,"['delete', 'dchs']","['delete', 'dchs']"
4,4,Property Owners Trainined,"['Office', 'of', 'Housing', 'Data']",,"['number_of_property_owners_trained', 'percent...","['Number of property owners trained', 'Percent...",chart,[],,['housing'],official,11.0,30.0,0.0,246.0,7.948,dashboard.alexandriava.gov,['housing'],"['housing', 'Office', 'of', 'Housing', 'Data']"


In [5]:
len(big_mash.index)

21807

## Building an LDA Model:

- Running an LDA model is very computationally intensive as explained below; this is why we did it in a separate notebook. Fortunately, we can save and import models.
- When building an LDA model, we pick the number of topics that we want the algorithm to find. This is a very important parameter.
- LDA Model fitting is an iterative process. The algorithm starts out by assigning every word to a temporary topic. Then, for *every* word, it updates the topics by calculating:
    - How prevalent is that word across topics? Topics with a high prevalance of the word in question get a higher weight for that word's assignment.
     - How prevalent are topics within a document? If one topic within a document is more prevalent, it gets a higher weight.
     - Based on these two criteria, LDA then updates a word's topic and document assignment.
     - **An LDA's topic formations get better the more passes you can run. However, running 60 passes of this dataset's model (we do have a large number of documents, although their word count is relatively low) takes about 35 minutes on a MacBook Air with 8GB of RAM.** 

### If we understand nothing else, we need to understand this: LDA is a *probabilistic* model and its results can fluctuate based on small changes in the data and how the model itself randomly assigns topics to begin with!
- This is a weakness in our analysis; it is not very replicable. It took several different LDA formations to settle on an ideal number of topics and to get consistently coherent topics.
- Our source data changes based on our API call. Our LDA model changes based upon both the source data and the model's probabilistic results itself. **Thus, for stability and replicability, we've saved both a dataframe and an LDA model. These are snapshots in time, not current, definitive structures.**

## Read in saved LDA Model
- Constructed from 52 topics; 60 passes on the tokens from the "big_mash" column of our dataframe:
- The source code for this is in the ipynb "Original LDA Code - Socrata API Dataset - Gensim LDA Groupings". **Note that if you run this code yourself (encouraged), you WILL get different results, and perhaps not useful ones**. Finding the right LDA settings for your dataset is time consuming and basically takes trial and error. Again, this is a weakness of our model.

In [6]:
lda_52_sixty = gensim.models.ldamodel.LdaModel.load('lda_52_sixty_good_model')

In [7]:
lda_52_sixty.show_topics(num_topics=52, formatted=False)

[(0,
  [('prevention', 0.039228265770020823),
   ('hospital', 0.038899194905431915),
   ('ny', 0.034882850346069173),
   ('healthy', 0.032344049932466652),
   ('statewide', 0.029269391777562728),
   ('hospitals', 0.022294582283033028),
   ('inpatient', 0.021854135158083089),
   ('healthcare', 0.018663035226976647),
   ('api', 0.018299534915519441),
   ('quality-safety-costs', 0.016526700960621438)]),
 (1,
  [('chart', 0.069963659585634108),
   ('children', 0.059274004768666677),
   ('income', 0.03908399441304488),
   ('historic', 0.025093228857131812),
   ('home', 0.021852521187430106),
   ('assistance', 0.018091177706023875),
   ('homes', 0.017940795661987072),
   ('families', 0.017918354840225928),
   ('low', 0.017857714487521079),
   ('pay', 0.017610064270261648)]),
 (2,
  [('population', 0.13151520321707977),
   ('county', 0.091960434049722567),
   ('demographics', 0.084275005086619365),
   ('age', 0.0434468663431666),
   ('total', 0.028221558698175012),
   ('king', 0.0194688109256

## Analysis of LDA Model Topic Formation:

This is a very decent topic analysis. By my count, **we've created 48 genuinely useful "clusters" of latent topics.** Since topic analysis is probabilistic, and our data contains "noise" (that is, words either far too specific or far too common to add real meaning) to begin with, **we have a few topics that won't help us.**
- Topic 32 is just too vague. It probably has something to do with some open government performance metrics, but there are no words to really a distinguish it.  This is not a failure of the algorithm; I don't doubt these words really are occuring together alot in the data. But "open, created, items" doesn't tell us much.
- Topic 24 clearly has to do with some "special" software/gis for open data (probably having to do with nursing) -- it looks like through the openmichigan portal. Again, these words are quite likely appearing together. But the practical application of a topic like this when it comes to determinging what is popular in the real world is limited.
- Topic 42 has a similar problem to topics 32 and 42 topics, but is a little more clear for content type. It looks like political information for several years. But we can't really tell more specifics.
- Topic 43 is what I'd call a "parochial" topic; it's clearly about common information like jobs and licensing from new york and michigan. Again, not useful as its own topic - but my spin is that this helps isolate "new york" and "ny" to keep them from inflating other topic views.

**Other topics are useful, but may mix content. It could be a vagary of the English language, or it might actually reveal new insights:**
- Topic 51 is a topic that tells us something, but it appears to be a mix of youth court cases and youth college enrollment. This could be that "enrollment" is used in English in the context of college and court-mandated programs. Then again, this could be "at-risk youth outcome statistics - either court or college (or both).
- Topic 37 appears to be about the construction, permits, and financials of building projects; since these are are pretty closely related in real life, however, I'd argue this is a good topic formation.

**Many topics contain a proper name, but clearly identify something useful:**
- Topic 9 -- "recreation", "parks", "jersey", "park", "centers", "neighborhood", "centers", etc. is clearly about parks and recreation facilities. It just contains jersey. Because of how we'll award our views/downloads counts to each topic, "jersey" should only skew this slightly. Again, this is just "error" (but not really error, according to the model - jersey probably had a lot of parks and recreation datasets) we have to tolerate if we can't exclude proper names as stop words.

**Topic 13 has a seemingly weird outlier:**
- Topic 13 is clearly about gas & fuel emissions, but also contains "food". I'm guessing "food" is linked to "gas" by the word "natural" across many datasets (natural food, natural gas). This is just noise we will have to tolerate.

**Many topics are absolutely beautiful. A few examples:**
- Topic 11 -- energy, environment, electricity, air, sustainable, action, climate, city, clean, facilities -- identifies a city's environmental initiatives with words I wouldn't have even thought of to group together.
- Topic 31 -- politics, government, election, campaign, elections, commissions, results, etc -- leaves no doubt about its content.
- Topic 5 -- locations, bacteria, hours, culture, county, levels, contact, directory, e [as in e coli], contains -- is amazing. Words that could be all sorts of different topics that become so clear in context together. This is about local bacteria levels! (Presumably in lakes)

## Load Saved Corpus:
- We are going to use our LDA model to transform this corpus so we can return the *topic composition* of each *document*.

In [8]:
corpus = corpora.MmCorpus('8.14corpus.mm')

In [9]:
corpus_lda = lda_52_sixty[corpus] #just a wrapper -- will convert on the fly when you call it

**Let's check that this lines up:**

In [10]:
for doc in corpus_lda[20000:20003]:
    print(doc) 

[(31, 0.50961538461538491)]
[(31, 0.50961538461538491)]
[(31, 0.50961538461538214)]


In [11]:
big_mash.big_mash[20000:20003]

20000    ['politics']
20001    ['politics']
20002    ['politics']
Name: big_mash, dtype: object

In [12]:
lda_52_sixty.show_topic(31)

[('politics', 0.11069826014751941),
 ('government', 0.098315050573643273),
 ('election', 0.038259777750699074),
 ('campaign', 0.030948562034191236),
 ('elections', 0.029682305103395073),
 ('commission', 0.029314992871497855),
 ('results', 0.024928316671568231),
 ('city', 0.02132785699829063),
 ('460', 0.020080548927070931),
 ('finance', 0.019735216016406935)]

All three documents have only the tag 'politics" to idenfity them. As we can see, our corpus transformation identifies document 31 as composing .5 of all of them. Document 31 is a politics and elections topic. Looks good.

In [13]:
len(corpus_lda)

21807

In [14]:
corpus_lda_list = list(corpus_lda[0:21790]) 

For some reason we can't convert our entire corpus into a list. It is truly mind boggling, and the potential issue is beyond my level of knowledge. But since we can convert most of the corpus, we'll make a (slightly abridged) new dataframe and add our topic compositions:

In [15]:
stats = big_mash.copy()
stats = stats.iloc[0:21790, :]
stats = stats.assign(topic_comp = corpus_lda_list)
stats.iloc[9009:9012, :] #random slice

Unnamed: 0.1,Unnamed: 0,name,description,attribution,columns_field_name,columns_name,type,categories,domain_category,domain_tags,provenance,download_count,page_views_last_month,page_views_last_week,page_views_total,page_views_total_log,domain,mash,big_mash,topic_comp
9009,9009,Returning Citizen Child Support & Inmate Oblig...,"['Data', 'are', 'provided', 'by', 'the', 'Depa...",Department of Human Resources (DHR),['number_of_released_inmates_who_pay_any_perce...,['Number of released inmates who pay any perce...,dataset,[],Health and Human Services,"['inmate obligors', 'prisoner re-entry', 're-e...",official,1433.0,20.0,1.0,628.0,9.297,data.maryland.gov,"['inmate', 'obligors', 'prisoner', 're-entry',...","['inmate', 'obligors', 'prisoner', 're-entry',...","[(4, 0.210601353097), (6, 0.0182243633613), (1..."
9010,9010,"Breakfast, Lunch, and At-Risk Afterschool Meal...","['This', 'dataset', 'tracks', 'participation',...",Maryland State Department of Education (MSDE),['at_risk_afterschool_meals_program_average_da...,['At-Risk Afterschool Program Avg. Daily Parti...,chart,['education'],Health and Human Services,"['meal', 'education', 'lunch', 'breakfast', 'c...",official,115.0,26.0,2.0,627.0,9.295,data.maryland.gov,"['meal', 'education', 'lunch', 'breakfast', 'c...","['meal', 'education', 'lunch', 'breakfast', 'c...","[(1, 0.0776627218935), (2, 0.039201183432), (7..."
9011,9011,Total Light Rail Trips Taken by Year: Column C...,"['Data', 'are', 'provided', 'by', 'the', 'Mary...",Maryland Transit Authority,"['fiscal_year', 'light_rail']","['Fiscal Year', 'Light Rail']",chart,['transportation'],Transportation,"['ridership', 'paratransit', 'mobility', 'rail...",official,181.0,12.0,1.0,621.0,9.281,data.maryland.gov,"['ridership', 'paratransit', 'mobility', 'rail...","['ridership', 'paratransit', 'mobility', 'rail...","[(0, 0.0156918612058), (2, 0.0263660953474), (..."


### Let's poke around to see how our model's calculated topic compositions do:

In [16]:
big_mash.big_mash[900]

"['police', 'traffic', 'fatality', 'fatalities', 'public', 'safety', 'Dataset', 'of', 'traffic', 'fatalities', 'January', '1st', 'December', '31st', '2015', 'The', 'Austin', 'Police', 'Department', 'Fatality', 'database', 'contains', 'only', 'those', 'crashes', 'investigated', 'by', 'APD', 'and', 'is', 'continuously', 'being', 'updated', 'due', 'to', 'on', 'going', 'investigations', 'The', 'data', 'provided', 'here', 'represents', 'a', 'snapshot', 'of', 'Traffic', 'Fatality', 'information', 'at', 'a', 'specific', 'point', 'in', 'time', 'and', 'may', 'change', 'Due', 'to', 'the', 'long', 'processing', 'times', 'for', 'toxicology', 'testing', 'impairment', 'and', 'suspected', 'impairment', 'statistics', 'are', 'based', 'on', 'the', 'initial', 'assessment', 'of', 'the', 'Detectives', 'and', 'Medical', 'Examiner']"

This dataset is about traffic fatalities and crashes as recorded by the police department.

In [17]:
for index, score in sorted(lda_52_sixty[corpus[900]], key=lambda tup: -1*tup[1]): 
    print("Score: {}\t Topic: {} \n".format(score, lda_52_sixty.print_topic(index, 15))) #15 word topics

Score: 0.5480149120022656	 Topic: 0.071*"public" + 0.069*"safety" + 0.048*"fire" + 0.021*"month" + 0.017*"calls" + 0.017*"police" + 0.017*"emergency" + 0.016*"response" + 0.015*"discharge" + 0.014*"incident" + 0.014*"department" + 0.011*"incidents" + 0.011*"time" + 0.011*"discharges" + 0.011*"medical" 

Score: 0.3004543236048098	 Topic: 0.046*"information" + 0.025*"tab" + 0.024*"projects" + 0.018*"concerning" + 0.018*"may" + 0.017*"contains" + 0.016*"project" + 0.014*"provided" + 0.014*"additional" + 0.012*"improvement" + 0.010*"capital" + 0.009*"totals" + 0.009*"available" + 0.009*"release" + 0.008*"due" 

Score: 0.051468371213253206	 Topic: 0.096*"water" + 0.048*"environment" + 0.021*"protection" + 0.020*"tx" + 0.020*"filed" + 0.019*"waste" + 0.019*"site" + 0.019*"quality" + 0.017*"environmental" + 0.016*"monitoring" + 0.015*"activity" + 0.013*"facility" + 0.012*"recycling" + 0.012*"activities" + 0.011*"modified" 

Score: 0.025782322551716974	 Topic: 0.168*"public" + 0.140*"safety" +

And indeed, our highest topic affinity comes from a public safety calls/emergency response and medical topic. That's a good match.

However, we can see some potential pitfalls here. Several other topics register, including "government performance measurements" tag. This speaks to the need to set a threshold for awarding a topic "credit" -- i.e. it composes more than .2 of a topic, or it's the largest topic.

In [18]:
big_mash.big_mash[20789]

"['benefits', 'veterans', 'services', 'veterans', 'measure', 'a', 'infrastructure', 'health', '&', 'human', 'services', 'Quarterly', 'Measure', 'A', 'dashboard', 'data', 'for', 'Veterans', 'Services', 'Assessment', 'and', 'Staffing', 'initiative', 'Claims', 'data', 'sourced', 'from', 'VetPro', 'DVS', '19', 'reports;', 'Office', 'contacts', 'sources', 'from', 'Daily', 'Appointment', 'Logs']"

In [19]:
for index, score in sorted(lda_52_sixty[corpus[20789]], key=lambda tup: -1*tup[1]): 
    print("Score: {}\t Topic: {} \n".format(score, lda_52_sixty.print_topic(index, 15))) 

Score: 0.19399944234439653	 Topic: 0.080*"business" + 0.032*"license" + 0.022*"job" + 0.022*"contracts" + 0.020*"businesses" + 0.018*"vendor" + 0.018*"owned" + 0.018*"contract" + 0.017*"licenses" + 0.017*"certified" + 0.016*"economy" + 0.016*"city" + 0.016*"list" + 0.015*"quantities" + 0.015*"vendors" 

Score: 0.13777744063768205	 Topic: 0.049*"transportation" + 0.046*"plans" + 0.037*"iowa" + 0.032*"area" + 0.030*"transit" + 0.030*"operations" + 0.029*"region" + 0.024*"bus" + 0.020*"priority" + 0.016*"people" + 0.014*"plan" + 0.013*"areas" + 0.013*"(including" + 0.012*"two" + 0.012*"dot" 

Score: 0.13194214859348494	 Topic: 0.176*"development" + 0.152*"housing" + 0.059*"economic" + 0.049*"community" + 0.019*"infrastructure" + 0.018*"managed" + 0.018*"medicaid" + 0.014*"department" + 0.014*"buildings" + 0.012*"economy" + 0.012*"assistance" + 0.011*"program" + 0.010*"family" + 0.008*"benefits" + 0.008*"addressing" 

Score: 0.10411140583554371	 Topic: 0.305*"health" + 0.061*"san" + 0.059*

Here's a great example of **INHERENT** uncertainty in our model. A human eye can barely tell what the metadata on this dataset is saying -- it looks like veterans services, specifically health care -- as indicated by "Daily Appointment Logs". It also contains a random infrastructure tag and just several words that don't add much real life meaning. 

Our model tries to match it with a business licenses topic. I wouldn't say that's right - but look at the composition score. It's not even .2. **This speaks to the need to set a "cut off" to get credit for a topic's views. **

In [20]:
big_mash.big_mash[6718]

"['finance', 'health', 'Insurance', 'plan', 'premiums', 'available', 'to', 'Iowa', 'individuals', 'for', '2017', 'under', 'the', 'Affordable', 'Care', 'Act']"

In [21]:
for index, score in sorted(lda_52_sixty[corpus[6718]], key=lambda tup: -1*tup[1]): 
    print("Score: {}\t Topic: {} \n".format(score, lda_52_sixty.print_topic(index, 15))) 

Score: 0.3175787077909651	 Topic: 0.110*"care" + 0.049*"finance" + 0.044*"provider" + 0.042*"child" + 0.035*"cost" + 0.031*"plan" + 0.029*"day" + 0.025*"bay" + 0.024*"government" + 0.018*"fee" + 0.018*"franklin" + 0.018*"act" + 0.017*"effective" + 0.013*"administrative" + 0.012*"efficient" 

Score: 0.2573984964913682	 Topic: 0.075*"economy" + 0.059*"employment" + 0.038*"labor" + 0.030*"unemployment" + 0.023*"insurance" + 0.022*"credit" + 0.021*"workforce" + 0.018*"economic" + 0.017*"training" + 0.016*"federal" + 0.015*"industry" + 0.014*"prosperous" + 0.014*"wages" + 0.013*"work" + 0.012*"compensation" 

Score: 0.09159134428015532	 Topic: 0.049*"transportation" + 0.046*"plans" + 0.037*"iowa" + 0.032*"area" + 0.030*"transit" + 0.030*"operations" + 0.029*"region" + 0.024*"bus" + 0.020*"priority" + 0.016*"people" + 0.014*"plan" + 0.013*"areas" + 0.013*"(including" + 0.012*"two" + 0.012*"dot" 

Score: 0.08984170784776785	 Topic: 0.176*"development" + 0.152*"housing" + 0.059*"economic" + 0.

We haven't really created a "health insurance" topic tag. However, the top topic for this metadata is topic that tries to match provider information and finances - it's top words "care, finance, provider" -- actually match a real life dataset about ACA insurance plans pretty well! **This is a good example of the model not being perfect, but still adding value.**

**Also note** our second highest affinity comes from an economy and unemployment topic, which includes "insurance". However, I bet this is in the context of unemployment insurance!

In [22]:
big_mash.big_mash[18]

"['housing', 'Office', 'of', 'Housing', 'Data']"

In [23]:
for index, score in sorted(lda_52_sixty[corpus[18]], key=lambda tup: -1*tup[1]): 
    print("Score: {}\t Topic: {} \n".format(score, lda_52_sixty.print_topic(index, 15))) 

Score: 0.5048076923076926	 Topic: 0.176*"development" + 0.152*"housing" + 0.059*"economic" + 0.049*"community" + 0.019*"infrastructure" + 0.018*"managed" + 0.018*"medicaid" + 0.014*"department" + 0.014*"buildings" + 0.012*"economy" + 0.012*"assistance" + 0.011*"program" + 0.010*"family" + 0.008*"benefits" + 0.008*"addressing" 

Score: 0.25480769230769246	 Topic: 0.048*"death" + 0.042*"maryland" + 0.029*"database" + 0.027*"deaths" + 0.027*"office" + 0.024*"department" + 0.021*"new" + 0.020*"cdc" + 0.019*"metadata" + 0.018*"citizens" + 0.017*"among" + 0.017*"vital" + 0.015*"state" + 0.014*"infants" + 0.014*"national" 



Solid match to a housing and community assistance topic.

# Now let's calculate the popularity of each topic across all datasets
- We will need some way to give a topic "credit" for the views and downloads of a dataset that it matches.

## "Winner take all" popularity metric:**
- **Scoring Rules:**
    - Only the topic that composes the largest share of a document scores "points" for its "Adjusted Popularity" total.
    - If a topic composes the largest share of that document, its "points" are its composition score times that dataset's combined log views and downloads.
        - We're using log views so as not to overweigh datasets that just have alot of eyeballs on them (like ones from NYC). 
    - So, in our example above, the "development, housing, economic, etc..." dataset would score .504 times that dataset's (log views + downloads)
    - Remember, each "document" is just a list of words from one of 28000 datasets' metadata.

First **let's remove healthcare.gov data from our dataset**; it inadvertently crept in there.

In [24]:
topics = stats[stats.domain != 'data.healthcare.gov']

In [25]:
def winner_take_all_pop(df):
    results_dict = {}
    df = df.fillna(0) #have to fill NaNs or you'll get wonky results

    for row_num in df.index:
        tup_list = df.topic_comp[row_num] #list of (topic, doc composition) tuples
        
        #neat little trick to return only the tuple w/highest index[1] value
        winner_tuple = max(tup_list, key=lambda item:item[1])  
            
        if not winner_tuple[0] in results_dict: #if not in dict, add it with its score
            #must use .loc, treating row_num like a label and not an integer
            results_dict[winner_tuple[0]] = (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
        else:
            pass
        
        if winner_tuple[0] in results_dict: #if in dict, increment that key's value with score
            results_dict[winner_tuple[0]] += (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
    
    return results_dict

In [26]:
winners_dict = winner_take_all_pop(topics)
#winners_dict #can uncomment just to double check dict numbers align with df topic number

Now let's make a quick and dirty function to replace these topic numbers with names in a sorted df for easy analysis:

In [27]:
def dict_to_df(d):
    df = pd.DataFrame.from_dict(d, orient='index')
    df = df.rename(index=str, columns={0:"Adjusted_Popularity"})
    df = df.reset_index() #DON'T drop; our index carries real value as it is our topic numbers
    df = df.rename(columns={'index': 'Topic Number'}) #when you reset_index with drop=False, you get orig index as col
    df['Topic Number'] = pd.to_numeric(df['Topic Number'])

    df = df.sort_values(by='Topic Number', ascending=True) 

    words_in_topics = [tup[1] for tup in lda_52_sixty.show_topics(num_topics=52, formatted=False)]
    df['Topic'] = words_in_topics

    df[["topic1", "topic2", "topic3", "topic4", 
           "topic5", "topic6", "topic7", "topic8", "topic9", "topic10"]] = df.Topic.apply(pd.Series)

    df = df.sort_values(by='Adjusted_Popularity', ascending=False)
    df = df.reset_index(drop=True) #this time we want to get rid of old index, which means nothing post sorting
    
    return df

In [28]:
winners_df = dict_to_df(winners_dict)
winners_df.head(15)

Unnamed: 0,Topic Number,Adjusted_Popularity,Topic,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
0,28,16465623.054,"[(transportation, 0.0493422015093), (plans, 0....","(transportation, 0.0493422015093)","(plans, 0.0463408207807)","(iowa, 0.0372400214617)","(area, 0.031583902009)","(transit, 0.0297466527633)","(operations, 0.029650998681)","(region, 0.0286308933708)","(bus, 0.0239404049245)","(priority, 0.0197652573751)","(people, 0.0157596265395)"
1,41,526719.44,"[(service, 0.0872627144419), (requests, 0.0350...","(service, 0.0872627144419)","(requests, 0.0350316945429)","(inspection, 0.0326789351929)","(inspections, 0.0313503518264)","(311, 0.0263119431044)","(public, 0.025968462717)","(violations, 0.0237687923195)","(request, 0.022587674832)","(complaints, 0.0193704153109)","(days, 0.0180026003213)"
2,23,494236.068,"[(police, 0.0495996521003), (user, 0.032265367...","(police, 0.0495996521003)","(user, 0.0322653671005)","(information, 0.02962646057)","(page, 0.0244358789937)","(department, 0.0237288839327)","(use, 0.023314363269)","(orleans, 0.0183127709217)","(injury, 0.0174977535732)","(may, 0.0166580734835)","(records, 0.0131997374277)"
3,3,391551.864,"[(transportation, 0.164542015423), (traffic, 0...","(transportation, 0.164542015423)","(traffic, 0.0382494150922)","(street, 0.0297088546702)","(parking, 0.0260484479646)","(infrastructure, 0.024499661836)","(city, 0.0242484880067)","(safe, 0.0239084069945)","(vehicle, 0.0231351100665)","(streets, 0.0219852292388)","(bike, 0.0155102577616)"
4,26,230235.304,"[(public, 0.168381768411), (safety, 0.13998228...","(public, 0.168381768411)","(safety, 0.139982283853)","(crime, 0.0631808672948)","(police, 0.03597164994)","(department, 0.0193424493246)","(illinois, 0.0152118075546)","(race, 0.0133652012852)","(reported, 0.011776909211)","(crimes, 0.0115504690759)","(criminal, 0.0105105785255)"
5,43,220462.909,"[(state, 0.175638296746), (new, 0.133403517998...","(state, 0.175638296746)","(new, 0.133403517998)","(michigan, 0.10266572561)","(york, 0.0777270773871)","(information, 0.0292838745522)","(check, 0.0203220904165)","(measurements, 0.0197080431536)","(jobs, 0.0170402721275)","(licensing, 0.0169768940674)","(ny, 0.0167776417369)"
6,18,189346.054,"[(column, 0.0634273088776), (update, 0.0466752...","(column, 0.0634273088776)","(update, 0.0466752365795)","(homeless, 0.0416639506669)","(annually, 0.0300723765516)","(frequency, 0.0292689424963)","(position, 0.021127522302)","(city, 0.019841539727)","(commercial, 0.0184651409887)","(daily, 0.0168690536946)","(salaries, 0.0155892433715)"
7,48,188011.679,"[(business, 0.0803271304597), (license, 0.0323...","(business, 0.0803271304597)","(license, 0.032302648157)","(job, 0.0220026249885)","(contracts, 0.0219341654552)","(businesses, 0.0197708503312)","(vendor, 0.0178329754117)","(owned, 0.0175281633585)","(contract, 0.0175169967387)","(licenses, 0.0169698891758)","(certified, 0.0166884614561)"
8,24,175960.006,"[(official, 0.0610942909105), (account, 0.0497...","(official, 0.0610942909105)","(account, 0.0497660475395)","(accounts, 0.0483014586496)","((openmichigan@michigan, 0.0408207146959)","(special, 0.0291639985893)","(nursing, 0.028770388416)","(software, 0.0268065464462)","(gis, 0.023974220834)","(use, 0.0225286064837)","(consumer, 0.0224473950606)"
9,29,172425.614,"[(planning, 0.0609328856062), (county, 0.05924...","(planning, 0.0609328856062)","(county, 0.0592460462606)","(district, 0.0590822346049)","(districts, 0.0460914886485)","(boundaries, 0.0372114905902)","(city, 0.029524721059)","(areas, 0.0270950065527)","(zoning, 0.0266346423368)","(gis, 0.0244396763289)","(council, 0.0230826924938)"


### Results:
- "Transportation Planning and Operations" predominates. Another transportation dataset -- something that I would call "Personal Transportation" -- parking, traffic, vehicle, and even bike -- is also the fourth most popular by this metric. Between these two topics, it's safe to conclude transportation datasets are very popular.
- Likewise with a "Police Information and Records" and "Crime Reports" topics, which place 3 and 5 respectively.
- The metrics for rankings 2, 3, and 4 are very close, and could easily change based on how we tweak our metrics. Let's see.

## "Winner Take All with Thresholds" Rules:
- **Scoring Rules:**
    - Same as "Winner Take All", except a winning topic must compose at least a certain threshold of a document to get any points.
    - We'll try 0.2 (low) and 0.5 (high) thresholds.

In [29]:
def winner_take_all_thresholds(df, thresh):
    results_dict = {}
    df = df.fillna(0) #make sure we have a clean df
    
    if thresh == "low":
        for row_num in df.index:
            tup_list = df.topic_comp[row_num] #list of (topic, doc composition) tuples
        
            #neat little trick to return only the tuple w/highest index[1] value
            winner_tuple = max(tup_list, key=lambda item:item[1])  
            
            if not winner_tuple[0] in results_dict: #if not in dict, add it with its score
                if winner_tuple[1] > 0.2:
                    results_dict[winner_tuple[0]] = (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
            else:
                pass
        
            if winner_tuple[0] in results_dict: #if in dict, increment that key's value with score
                if winner_tuple[1] > 0.2:
                    results_dict[winner_tuple[0]] += (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
            else:
                pass
                
    if thresh == "high":
        for row_num in df.index:
            tup_list = df.topic_comp[row_num] #list of (topic, doc composition) tuples
        
            #neat little trick to return only the tuple w/highest index[1] value
            winner_tuple = max(tup_list, key=lambda item:item[1])  
            
            if not winner_tuple[0] in results_dict: #if not in dict, add it with its score
                if winner_tuple[1] > 0.5:
                    results_dict[winner_tuple[0]] = (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
            else:
                pass
        
            if winner_tuple[0] in results_dict: #if in dict, increment that key's value with score
                if winner_tuple[1] > 0.5:
                    results_dict[winner_tuple[0]] += (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
            else:
                pass

    return results_dict

In [30]:
low = winner_take_all_thresholds(topics, thresh='low')

In [31]:
low_thresh_ranks = dict_to_df(low)

In [32]:
low_thresh_ranks.head(10)

Unnamed: 0,Topic Number,Adjusted_Popularity,Topic,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
0,28,16461078.655,"[(transportation, 0.0493422015093), (plans, 0....","(transportation, 0.0493422015093)","(plans, 0.0463408207807)","(iowa, 0.0372400214617)","(area, 0.031583902009)","(transit, 0.0297466527633)","(operations, 0.029650998681)","(region, 0.0286308933708)","(bus, 0.0239404049245)","(priority, 0.0197652573751)","(people, 0.0157596265395)"
1,41,518910.864,"[(service, 0.0872627144419), (requests, 0.0350...","(service, 0.0872627144419)","(requests, 0.0350316945429)","(inspection, 0.0326789351929)","(inspections, 0.0313503518264)","(311, 0.0263119431044)","(public, 0.025968462717)","(violations, 0.0237687923195)","(request, 0.022587674832)","(complaints, 0.0193704153109)","(days, 0.0180026003213)"
2,23,493161.782,"[(police, 0.0495996521003), (user, 0.032265367...","(police, 0.0495996521003)","(user, 0.0322653671005)","(information, 0.02962646057)","(page, 0.0244358789937)","(department, 0.0237288839327)","(use, 0.023314363269)","(orleans, 0.0183127709217)","(injury, 0.0174977535732)","(may, 0.0166580734835)","(records, 0.0131997374277)"
3,3,325695.778,"[(transportation, 0.164542015423), (traffic, 0...","(transportation, 0.164542015423)","(traffic, 0.0382494150922)","(street, 0.0297088546702)","(parking, 0.0260484479646)","(infrastructure, 0.024499661836)","(city, 0.0242484880067)","(safe, 0.0239084069945)","(vehicle, 0.0231351100665)","(streets, 0.0219852292388)","(bike, 0.0155102577616)"
4,26,219560.73,"[(public, 0.168381768411), (safety, 0.13998228...","(public, 0.168381768411)","(safety, 0.139982283853)","(crime, 0.0631808672948)","(police, 0.03597164994)","(department, 0.0193424493246)","(illinois, 0.0152118075546)","(race, 0.0133652012852)","(reported, 0.011776909211)","(crimes, 0.0115504690759)","(criminal, 0.0105105785255)"
5,43,208198.844,"[(state, 0.175638296746), (new, 0.133403517998...","(state, 0.175638296746)","(new, 0.133403517998)","(michigan, 0.10266572561)","(york, 0.0777270773871)","(information, 0.0292838745522)","(check, 0.0203220904165)","(measurements, 0.0197080431536)","(jobs, 0.0170402721275)","(licensing, 0.0169768940674)","(ny, 0.0167776417369)"
6,18,188072.809,"[(column, 0.0634273088776), (update, 0.0466752...","(column, 0.0634273088776)","(update, 0.0466752365795)","(homeless, 0.0416639506669)","(annually, 0.0300723765516)","(frequency, 0.0292689424963)","(position, 0.021127522302)","(city, 0.019841539727)","(commercial, 0.0184651409887)","(daily, 0.0168690536946)","(salaries, 0.0155892433715)"
7,48,181698.149,"[(business, 0.0803271304597), (license, 0.0323...","(business, 0.0803271304597)","(license, 0.032302648157)","(job, 0.0220026249885)","(contracts, 0.0219341654552)","(businesses, 0.0197708503312)","(vendor, 0.0178329754117)","(owned, 0.0175281633585)","(contract, 0.0175169967387)","(licenses, 0.0169698891758)","(certified, 0.0166884614561)"
8,24,172835.501,"[(official, 0.0610942909105), (account, 0.0497...","(official, 0.0610942909105)","(account, 0.0497660475395)","(accounts, 0.0483014586496)","((openmichigan@michigan, 0.0408207146959)","(special, 0.0291639985893)","(nursing, 0.028770388416)","(software, 0.0268065464462)","(gis, 0.023974220834)","(use, 0.0225286064837)","(consumer, 0.0224473950606)"
9,29,170112.755,"[(planning, 0.0609328856062), (county, 0.05924...","(planning, 0.0609328856062)","(county, 0.0592460462606)","(district, 0.0590822346049)","(districts, 0.0460914886485)","(boundaries, 0.0372114905902)","(city, 0.029524721059)","(areas, 0.0270950065527)","(zoning, 0.0266346423368)","(gis, 0.0244396763289)","(council, 0.0230826924938)"


### Interesting. The calculated scores haven't changed that much, but they have moved by a few thousand in many cases. 

We can see from this that how you score matters. Let's see how much by setting a very high affinity threshold:

In [33]:
high = winner_take_all_thresholds(topics, thresh='high')

In [34]:
high_thresh_ranks = dict_to_df(high)

In [35]:
high_thresh_ranks.head(10)

Unnamed: 0,Topic Number,Adjusted_Popularity,Topic,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
0,28,15956210.504,"[(transportation, 0.0493422015093), (plans, 0....","(transportation, 0.0493422015093)","(plans, 0.0463408207807)","(iowa, 0.0372400214617)","(area, 0.031583902009)","(transit, 0.0297466527633)","(operations, 0.029650998681)","(region, 0.0286308933708)","(bus, 0.0239404049245)","(priority, 0.0197652573751)","(people, 0.0157596265395)"
1,23,449272.292,"[(police, 0.0495996521003), (user, 0.032265367...","(police, 0.0495996521003)","(user, 0.0322653671005)","(information, 0.02962646057)","(page, 0.0244358789937)","(department, 0.0237288839327)","(use, 0.023314363269)","(orleans, 0.0183127709217)","(injury, 0.0174977535732)","(may, 0.0166580734835)","(records, 0.0131997374277)"
2,41,275563.717,"[(service, 0.0872627144419), (requests, 0.0350...","(service, 0.0872627144419)","(requests, 0.0350316945429)","(inspection, 0.0326789351929)","(inspections, 0.0313503518264)","(311, 0.0263119431044)","(public, 0.025968462717)","(violations, 0.0237687923195)","(request, 0.022587674832)","(complaints, 0.0193704153109)","(days, 0.0180026003213)"
3,24,136515.744,"[(official, 0.0610942909105), (account, 0.0497...","(official, 0.0610942909105)","(account, 0.0497660475395)","(accounts, 0.0483014586496)","((openmichigan@michigan, 0.0408207146959)","(special, 0.0291639985893)","(nursing, 0.028770388416)","(software, 0.0268065464462)","(gis, 0.023974220834)","(use, 0.0225286064837)","(consumer, 0.0224473950606)"
4,43,102284.078,"[(state, 0.175638296746), (new, 0.133403517998...","(state, 0.175638296746)","(new, 0.133403517998)","(michigan, 0.10266572561)","(york, 0.0777270773871)","(information, 0.0292838745522)","(check, 0.0203220904165)","(measurements, 0.0197080431536)","(jobs, 0.0170402721275)","(licensing, 0.0169768940674)","(ny, 0.0167776417369)"
5,31,92203.533,"[(politics, 0.110698260148), (government, 0.09...","(politics, 0.110698260148)","(government, 0.0983150505736)","(election, 0.0382597777507)","(campaign, 0.0309485620342)","(elections, 0.0296823051034)","(commission, 0.0293149928715)","(results, 0.0249283166716)","(city, 0.0213278569983)","(460, 0.0200805489271)","(finance, 0.0197352160164)"
6,46,70784.204,"[(address, 0.0425142976586), (name, 0.02889526...","(address, 0.0425142976586)","(name, 0.028895265436)","(number, 0.0260959266214)","(facility, 0.0211950547561)","(patient, 0.0187283290738)","(location, 0.0173565095563)","(contains, 0.0147027345652)","(type, 0.0140782421165)","(g, 0.013900974383)","(transfer, 0.0138993318674)"
7,26,63222.136,"[(public, 0.168381768411), (safety, 0.13998228...","(public, 0.168381768411)","(safety, 0.139982283853)","(crime, 0.0631808672948)","(police, 0.03597164994)","(department, 0.0193424493246)","(illinois, 0.0152118075546)","(race, 0.0133652012852)","(reported, 0.011776909211)","(crimes, 0.0115504690759)","(criminal, 0.0105105785255)"
8,32,56775.321,"[(open, 0.111361606245), (created, 0.066633337...","(open, 0.111361606245)","(created, 0.0666333370863)","(items, 0.0469751568036)","(clear, 0.0249716560821)","(volume, 0.0175690284296)","(please, 0.0162081895116)","(presented, 0.015509271243)","(difficult, 0.0141524865491)","(potential, 0.013902418106)","(note, 0.0129587647759)"
9,50,56295.097,"[(program, 0.032511463571), (employees, 0.0289...","(program, 0.032511463571)","(employees, 0.0289751722757)","(adult, 0.0244085054991)","(employee, 0.0242968020528)","(annual, 0.0204986126128)","(names, 0.0182179413533)","(full, 0.0179091183965)","(abuse, 0.0162096834129)","(nutrition, 0.0157677124596)","(beginning, 0.0142137708217)"


**This is very interesting. Our main transportation tag still reigns, but the other transportation tag has fallen completely out of the top 10! Our 311 topic tag still hangs in there, but falls to third versus police records. We can draw a few conclusions from this:**
- Transportation, 311/service calls, and police records are resilient; they remain popular across calculations.
- Other tags shift rankings significantly. We should be careful about drawing sweeping conclusions from ranks 4-10. 
    - To begin with, we've sacrificed interpretability to reduce variance in our adjusted popularity calculation. We don't want total views to make NYC datasets the "most popular". But by summing log views and downloads, and taking a portion of those views, we've lost some ability to interpret absolute differences in our calculations. A human knows the difference between 1000 views and 10 views; what's the difference between 100,095 adjusted popularity and 92,203? It's not intuitive.
    - These results don't mean "release this open dataset 3rd, then this 4th, etc. It's just a guide to a snapshot of Socrata data at a certain point in time. **As noted, however, it is safe to conclude that transportation plans and operations, along with police records and 311 data, are datasets the public accesses frequently.**
- Our "catch all" tags seem to get more popular with a stricter cut off. I think it's possible that they match parochial datasets very well, but match fewer datasets as a winner overall. Topics that actually tell us something but are more broad may match datasets less well, but match more (at a 0.3-0.4 threshold), therefore losing some credit with a stricter threshold. This is just speculation, however.

### Let's a method of calculating wherein a dataset can get "partial credit", as long as it's above a certain threshold.
- Datasets get their share of combined log views + downloads, as long as they compose at least 0.2 of a document.

In [36]:
def calc_proportional_pop(df):
    df = df.fillna(0)
    results_dict = {}
    
    for row_num in df.index:
        for tup in df.topic_comp[row_num]: #no need to pick a winner here...
            if not tup[0] in results_dict:
                if tup[1] >= 0.2: #.loc interprets as the label of the index, not int position
                    results_dict[tup[0]] = (tup[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
                else:
                    pass
            if tup[0] in results_dict:
                if tup[1] >= 0.2:
                    results_dict[tup[0]] += (tup[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total_log))
    return results_dict

In [37]:
proportions = calc_proportional_pop(topics) #pretty sure .loc makes this slower than using .iloc

In [38]:
proportional_pop_df = dict_to_df(proportions)

In [39]:
proportional_pop_df.head(15)

Unnamed: 0,Topic Number,Adjusted_Popularity,Topic,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
0,28,16478738.222,"[(transportation, 0.0493422015093), (plans, 0....","(transportation, 0.0493422015093)","(plans, 0.0463408207807)","(iowa, 0.0372400214617)","(area, 0.031583902009)","(transit, 0.0297466527633)","(operations, 0.029650998681)","(region, 0.0286308933708)","(bus, 0.0239404049245)","(priority, 0.0197652573751)","(people, 0.0157596265395)"
1,41,586779.754,"[(service, 0.0872627144419), (requests, 0.0350...","(service, 0.0872627144419)","(requests, 0.0350316945429)","(inspection, 0.0326789351929)","(inspections, 0.0313503518264)","(311, 0.0263119431044)","(public, 0.025968462717)","(violations, 0.0237687923195)","(request, 0.022587674832)","(complaints, 0.0193704153109)","(days, 0.0180026003213)"
2,23,502600.191,"[(police, 0.0495996521003), (user, 0.032265367...","(police, 0.0495996521003)","(user, 0.0322653671005)","(information, 0.02962646057)","(page, 0.0244358789937)","(department, 0.0237288839327)","(use, 0.023314363269)","(orleans, 0.0183127709217)","(injury, 0.0174977535732)","(may, 0.0166580734835)","(records, 0.0131997374277)"
3,3,359881.938,"[(transportation, 0.164542015423), (traffic, 0...","(transportation, 0.164542015423)","(traffic, 0.0382494150922)","(street, 0.0297088546702)","(parking, 0.0260484479646)","(infrastructure, 0.024499661836)","(city, 0.0242484880067)","(safe, 0.0239084069945)","(vehicle, 0.0231351100665)","(streets, 0.0219852292388)","(bike, 0.0155102577616)"
4,31,251066.127,"[(politics, 0.110698260148), (government, 0.09...","(politics, 0.110698260148)","(government, 0.0983150505736)","(election, 0.0382597777507)","(campaign, 0.0309485620342)","(elections, 0.0296823051034)","(commission, 0.0293149928715)","(results, 0.0249283166716)","(city, 0.0213278569983)","(460, 0.0200805489271)","(finance, 0.0197352160164)"
5,26,249091.564,"[(public, 0.168381768411), (safety, 0.13998228...","(public, 0.168381768411)","(safety, 0.139982283853)","(crime, 0.0631808672948)","(police, 0.03597164994)","(department, 0.0193424493246)","(illinois, 0.0152118075546)","(race, 0.0133652012852)","(reported, 0.011776909211)","(crimes, 0.0115504690759)","(criminal, 0.0105105785255)"
6,43,241684.562,"[(state, 0.175638296746), (new, 0.133403517998...","(state, 0.175638296746)","(new, 0.133403517998)","(michigan, 0.10266572561)","(york, 0.0777270773871)","(information, 0.0292838745522)","(check, 0.0203220904165)","(measurements, 0.0197080431536)","(jobs, 0.0170402721275)","(licensing, 0.0169768940674)","(ny, 0.0167776417369)"
7,29,233415.749,"[(planning, 0.0609328856062), (county, 0.05924...","(planning, 0.0609328856062)","(county, 0.0592460462606)","(district, 0.0590822346049)","(districts, 0.0460914886485)","(boundaries, 0.0372114905902)","(city, 0.029524721059)","(areas, 0.0270950065527)","(zoning, 0.0266346423368)","(gis, 0.0244396763289)","(council, 0.0230826924938)"
8,24,222915.66,"[(official, 0.0610942909105), (account, 0.0497...","(official, 0.0610942909105)","(account, 0.0497660475395)","(accounts, 0.0483014586496)","((openmichigan@michigan, 0.0408207146959)","(special, 0.0291639985893)","(nursing, 0.028770388416)","(software, 0.0268065464462)","(gis, 0.023974220834)","(use, 0.0225286064837)","(consumer, 0.0224473950606)"
9,48,204355.994,"[(business, 0.0803271304597), (license, 0.0323...","(business, 0.0803271304597)","(license, 0.032302648157)","(job, 0.0220026249885)","(contracts, 0.0219341654552)","(businesses, 0.0197708503312)","(vendor, 0.0178329754117)","(owned, 0.0175281633585)","(contract, 0.0175169967387)","(licenses, 0.0169698891758)","(certified, 0.0166884614561)"


### These are intuitive results. Our top 6 topics are clean and interpretable, and pass a gut-check in terms of what citizens might search and view.
- However, there is really not much separating topics 5-10 (again, small differences in magnitude in our adjusted popularity metric don't tell us much).
- Transportation, police records, and 311 are still atop the heap. **However, our second transportation category comes all the way back to #4;** I'd bet that this is because it's often second place to the more popular transportation topic in the winner-take-all format.
- I would argue this is the best format for scoring; **some datasets really do, to the human eye, fit into two (or three or four!) topic tags**. This also dilutes model error in which the "winner" doesn't really match the dataset according to the human eye; a dataset that does can still get partial credit.
- However, you can make an argument for only winner topics of high affinity. This type of probabilistic analysis is always going to have assumptions - it's just important to note them.

# Top Topic Tags by Domain
- Let's see if the most popular topics vary by city/state. We can use "domain" -- the web address of an open data portal -- to sort them.

In [40]:
def popularity_by_domain(df, domain_name):
    
    domain_df = df[df.domain == domain_name]
    popularity_dict = calc_proportional_pop(domain_df) #we're still using our proportional scoring here
    
    df_pop = pd.DataFrame.from_dict(popularity_dict, orient='index') #dict is in df but index numbers (aka topic numbers) are random
    df_pop = df_pop.rename(index=int, columns={0:"Adjusted_Popularity"})
    
    topic_words_list = []
    for topic_num in df_pop.index: #this iterates in the right order - index numbers have real meaning
        topic_tups = lda_52_sixty.show_topic(topic_num)
        raw_words = list(i[0] for i in lda_52_sixty.show_topic(topic_num))
        string = ", ".join(raw_words)
        topic_words_list.append(string) #plug in names by index numbers
    
    df_pop['Topic_Words'] = topic_words_list
    
    df_pop = df_pop.sort_values(by='Adjusted_Popularity', ascending=False)
    
    #put the domain in each row - kind of an ugly way to preserve info about domain but it's what we've got
    domain_string = domain_name
    df_pop['Domain'] = [domain_string for number in range(len(df_pop.index))]
    
    return df_pop

In [41]:
madison = popularity_by_domain(topics, "data.cityofmadison.com")

In [42]:
madison

Unnamed: 0,Adjusted_Popularity,Topic_Words,Domain
31,3925.956,"politics, government, election, campaign, elec...",data.cityofmadison.com
21,3087.734,"public, safety, fire, month, calls, police, em...",data.cityofmadison.com
26,2661.438,"public, safety, crime, police, department, ill...",data.cityofmadison.com
49,2649.683,"development, housing, economic, community, inf...",data.cityofmadison.com
0,1713.837,"prevention, hospital, ny, healthy, statewide, ...",data.cityofmadison.com
12,1133.078,"map, ""about"", shows, details, tracking, click,...",data.cityofmadison.com
42,832.791,"information, list, code, required, political, ...",data.cityofmadison.com
1,661.705,"chart, children, income, historic, home, assis...",data.cityofmadison.com
9,559.38,"recreation, parks, jersey, park, center, neigh...",data.cityofmadison.com
41,391.758,"service, requests, inspection, inspections, 31...",data.cityofmadison.com


Interesting! By our log measures, politics is the most popular topic in Madison. Makes sense.

### Finally, by this scoring, let's return all topics that formed a top topic and the city's they were top in:

In [43]:
def top_topics_by_domain(df, domains_series):
    
    domains_dict = {}
    for domain in domains_series:
        
        domain_df = popularity_by_domain(df, domain) #this is a df like the Madison example
        Topic = domain_df.iloc[0].Topic_Words #string
        Domain = domain_df.iloc[0].Domain.split() #have to remember to split(), which really just gives us one word
        
        if Topic in domains_dict: 
            domains_dict[Topic] += Domain
        else:
            domains_dict[Topic] = list(Domain)
    
    return domains_dict

(This takes a bit to run since we are slicing df's for 140+ domains and iterating through them. There's probably a better way.)

In [44]:
all_domains = topics.domain.unique()
top_topic_domains = top_topics_by_domain(topics, domains_series=all_domains)

In [45]:
top_topic_domains

{'budget, finance, fund, year, operating, city, fiscal, funds, calendar, communities': ['data.smcgov.org',
  'data.tompsc.com',
  'data.topeka.org',
  'data.vbgov.com',
  'information.stpaul.gov',
  'performance.ci.janesville.wi.us'],
 'business, license, job, contracts, businesses, vendor, owned, contract, licenses, certified': ['data.auburnwa.gov',
  'data.cityofgainesville.org',
  'data.culvercity.org',
  'data.hampton.gov',
  'data.oregon.gov',
  'data.oxnard.org',
  'opendata.cityofhenderson.com',
  'opendata.lasvegasnevada.gov'],
 'care, finance, provider, child, cost, plan, day, bay, government, fee': ['data.detroitmi.gov'],
 'census, indicators, survey, includes, indicator, community, demographics, bureau, u, level': ['data.livewellsd.org'],
 'chart, children, income, historic, home, assistance, homes, families, low, pay': ['data.richmondgov.com'],
 'column, update, homeless, annually, frequency, position, city, commercial, daily, salaries': ['data.cityofwestsacramento.org',
  

Now we have some additional insights into our top rankings. **Notice that our top topic, "transportation planning and operations" is only the most popular dataset in Texas and a California transit agency (as well as the smaller City of Grand Prairie).** This makes sense -- people in big populated places probably care alot about transportation projects. 

This topic is also definitely getting some weight from the fact that Texas and California have high populations and presumably high open data access rates. However, this is why we took the log of views; to dilute the effects of high population.

### Still, this has us thinking -- let's see if we get different rankings if we sum a topics popularity within *each* city/state, and then log this sum.
- This will further reduce the weight of city/state open data portals that just get a high number of views and downloads (presumably) by virtue of being a high-population area.

## One more scoring setting: Normalize Metrics within a City/State 
## (Dampened Popularity):
- Each city/state/organization (as identified by domain) has topics that "win" for each dataset.
- Add up each's topics "spoils" (total views and downloads) within a given city. THEN take the log of that.
- Then add up all log totals for each topic across cities.
- This is an extra control for big cities with lots of users skewing our metrics. **We'll call it dampened popularity.**

In [46]:
from collections import Counter #need this to sync dictionaries

In [47]:
def popularity_normalized_by_area(df, domains_series):
    
    df = df.fillna(0) #there are 369 datasets with NaN download counts (no missing view counts)
    #let's assume they're NaN because there were no downloads
    
    list_of_domain_dicts = []
    popularity_dict = {}
    
    for domain in domains_series:
        
        results_dict = {}
        
        #get our df only of rows from a given city/state domain
        domain_df = df[df.domain == domain]
        
        for row_num in domain_df.index:
            tup_list = domain_df.topic_comp[row_num] #list of (topic, doc composition) tuples
        
            #neat little trick to return only the tuple w/highest index[1] value
            winner_tuple = max(tup_list, key=lambda item:item[1])  
            
            if not winner_tuple[0] in results_dict: #if not in dict, add it with its TOTAL VIEWS score
                if winner_tuple[1] > 0.2:
                    results_dict[winner_tuple[0]] = (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total))
            else:
                pass
        
            if winner_tuple[0] in results_dict: #if in dict, increment that key's value with score
                if winner_tuple[1] > 0.2:
                    results_dict[winner_tuple[0]] += (winner_tuple[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total))
            else:
                pass
            
        
        #when loop of domain_df is finished, take log of all keys in dict
        log_dict = {}
        for k,v in results_dict.items():
            if v != 0: #have to do this since you can't take a log of 0; some "winners" get datasets with 0 views
                log_dict[k] = np.log(v)
        
        #now we have a polished dict of topic numbers as keys and log of all views/DLs as values; append it to list
        list_of_domain_dicts.append(log_dict)
    
    #use Counter() object to sync our dictionaries
    c = Counter()
    for d in list_of_domain_dicts:
        c.update(d)
    
    popularity_dict = dict(c)
    
    return popularity_dict

In [48]:
all_domains = topics.domain.unique()
pop_norm_area_dict = popularity_normalized_by_area(topics, all_domains)

In [49]:
#pop_norm_area_dict #uncomment just to ensure topic numbers align with topic tags

In [50]:
#lda_52_sixty.show_topics(num_topics=52, formatted=False)

In [51]:
norm_by_areas_df = dict_to_df(pop_norm_area_dict)
norm_by_areas_df.head(20)

Unnamed: 0,Topic Number,Adjusted_Popularity,Topic,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
0,26,799.492,"[(public, 0.168381768411), (safety, 0.13998228...","(public, 0.168381768411)","(safety, 0.139982283853)","(crime, 0.0631808672948)","(police, 0.03597164994)","(department, 0.0193424493246)","(illinois, 0.0152118075546)","(race, 0.0133652012852)","(reported, 0.011776909211)","(crimes, 0.0115504690759)","(criminal, 0.0105105785255)"
1,3,692.631,"[(transportation, 0.164542015423), (traffic, 0...","(transportation, 0.164542015423)","(traffic, 0.0382494150922)","(street, 0.0297088546702)","(parking, 0.0260484479646)","(infrastructure, 0.024499661836)","(city, 0.0242484880067)","(safe, 0.0239084069945)","(vehicle, 0.0231351100665)","(streets, 0.0219852292388)","(bike, 0.0155102577616)"
2,21,670.051,"[(public, 0.0713314310777), (safety, 0.0690276...","(public, 0.0713314310777)","(safety, 0.0690276616473)","(fire, 0.0477503790148)","(month, 0.0205819878206)","(calls, 0.0173756495075)","(police, 0.0170781244666)","(emergency, 0.0165221049163)","(response, 0.0157891037516)","(discharge, 0.0150688301524)","(incident, 0.0140341527688)"
3,49,623.113,"[(development, 0.176027756751), (housing, 0.15...","(development, 0.176027756751)","(housing, 0.152212657649)","(economic, 0.0588541094432)","(community, 0.0485919430535)","(infrastructure, 0.0190401252932)","(managed, 0.0183046960354)","(medicaid, 0.0178764063515)","(department, 0.0140238290886)","(buildings, 0.0137754059922)","(economy, 0.0123473943134)"
4,37,584.586,"[(financial, 0.0959611006916), (permits, 0.066...","(financial, 0.0959611006916)","(permits, 0.0668632739602)","(building, 0.0631953900546)","(permit, 0.054456521697)","(expenditures, 0.0429247389481)","(guide, 0.0250580534259)","(issued, 0.0232434329642)","(construction, 0.0228846174877)","(information, 0.0227968624179)","(filter, 0.021954851646)"
5,40,582.037,"[(finance, 0.164742098166), (year, 0.062366577...","(finance, 0.164742098166)","(year, 0.062366577351)","(state, 0.0533822539548)","(monthly, 0.0506962481995)","(fiscal, 0.0476222221357)","(payments, 0.0357199197181)","(government, 0.0264540232587)","(june, 0.0225770287471)","(july, 0.0216759115561)","(report, 0.0215900104843)"
6,31,572.482,"[(politics, 0.110698260148), (government, 0.09...","(politics, 0.110698260148)","(government, 0.0983150505736)","(election, 0.0382597777507)","(campaign, 0.0309485620342)","(elections, 0.0296823051034)","(commission, 0.0293149928715)","(results, 0.0249283166716)","(city, 0.0213278569983)","(460, 0.0200805489271)","(finance, 0.0197352160164)"
7,48,560.589,"[(business, 0.0803271304597), (license, 0.0323...","(business, 0.0803271304597)","(license, 0.032302648157)","(job, 0.0220026249885)","(contracts, 0.0219341654552)","(businesses, 0.0197708503312)","(vendor, 0.0178329754117)","(owned, 0.0175281633585)","(contract, 0.0175169967387)","(licenses, 0.0169698891758)","(certified, 0.0166884614561)"
8,41,499.286,"[(service, 0.0872627144419), (requests, 0.0350...","(service, 0.0872627144419)","(requests, 0.0350316945429)","(inspection, 0.0326789351929)","(inspections, 0.0313503518264)","(311, 0.0263119431044)","(public, 0.025968462717)","(violations, 0.0237687923195)","(request, 0.022587674832)","(complaints, 0.0193704153109)","(days, 0.0180026003213)"
9,34,496.677,"[(education, 0.181504757587), (school, 0.08185...","(education, 0.181504757587)","(school, 0.0818576925692)","(schools, 0.0317766682176)","(students, 0.0295716293995)","(public, 0.0171734235818)","(governance, 0.0167833173256)","(state, 0.0165785793709)","(year, 0.0161596047936)","(student, 0.0161397342546)","(district, 0.0160036278832)"


In [52]:
#a little cleaner view for display online:
def display_view(df):
    
    pd.options.display.max_colwidth = 120 #so we can see our whole topic
    
    clean_topics = []
    for topic_tup in df.Topic: 
        raw_words = list(i[0] for i in topic_tup)
        string = ", ".join(raw_words)
        clean_topics.append(string)
    
    clean_df = df.iloc[:, 0:2] #take off the old topic cols
    clean_df['Topic'] = clean_topics #plug in clean topic strings list to new df
    
    clean_df.columns = ['Topic ID', 'Adjusted Popularity Score', 'Topic Content'] #rename for web
    
    clean_df.index = np.arange(1, len(clean_df.index) + 1) #we'll start index at 1 so it's easier to view ranks
    
    return clean_df

In [53]:
display_view = display_view(norm_by_areas_df)
display_view

Unnamed: 0,Topic ID,Adjusted Popularity Score,Topic Content
1,26,799.492,"public, safety, crime, police, department, illinois, race, reported, crimes, criminal"
2,3,692.631,"transportation, traffic, street, parking, infrastructure, city, safe, vehicle, streets, bike"
3,21,670.051,"public, safety, fire, month, calls, police, emergency, response, discharge, incident"
4,49,623.113,"development, housing, economic, community, infrastructure, managed, medicaid, department, buildings, economy"
5,37,584.586,"financial, permits, building, permit, expenditures, guide, issued, construction, information, filter"
6,40,582.037,"finance, year, state, monthly, fiscal, payments, government, june, july, report"
7,31,572.482,"politics, government, election, campaign, elections, commission, results, city, 460, finance"
8,48,560.589,"business, license, job, contracts, businesses, vendor, owned, contract, licenses, certified"
9,41,499.286,"service, requests, inspection, inspections, 311, public, violations, request, complaints, days"
10,34,496.677,"education, school, schools, students, public, governance, state, year, student, district"


In [54]:
#display_view.to_csv('final_topic_ranks.csv')

### Dampened Popularity Results:
Interesting! Our top transportation tag has fallen out of the Top 10. Perhaps California and Texas really were giving it too much of a boost, even with log_views. 
- We can see that two public safety tags -- I'd call them "reported crimes" "public safety calls" have taken the top spot. 
- Our "Personal Transportation" topic is #2. The real life topic of transportation information is still a popular topic.
- Interestingly, "annual budget and fiscal year data" comes in at #6 for the first time.
- 311 calls has fallen all the way to #9! **Unsurprisingly, 311 was a top topic in New York City.**
- Our parochial datasets have fallen out of the top ranks. This is good.
- **The separation between popularity metrics is much smaller and less significant. We should take care in making sweeping conclusions.** Again, we set a 0.2 threshold for this scoring category in winner-take_all. Different scoring could change this.

### Finally, let's determine most popular topics by domain using *raw totals of dataset views and downloads*. 
- Previously, we looked at the top topics for each domain using log-adjusted scoring.
- But there's really no need to control for big cities by using a logarithm on our totals when we are comparing WITHIN one city.

In [55]:
#first let's modify our old proportional_popularity function to just count TOTAL views and downloads; 
#no need to take the log to control for big cities if we are ONLY calculating what's most popular in one city

def calc_raw_pop(df):
    df = df.fillna(0)
    results_dict = {}
    
    for row_num in df.index:
        for tup in df.topic_comp[row_num]: #no need to pick a winner here...
            if not tup[0] in results_dict:
                if tup[1] >= 0.2: #.loc interprets as the label of the index, not int position
                    results_dict[tup[0]] = (tup[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total))
                else:
                    pass
            if tup[0] in results_dict:
                if tup[1] >= 0.2:
                    results_dict[tup[0]] += (tup[1] * (df.loc[row_num].download_count + 
                                             df.loc[row_num].page_views_total))
    return results_dict

In [56]:
#next we need to modify our specific domain popularity function to call calc_raw_pop, not calc_proportional_pop
def raw_popularity_by_domain(df, domain_name):
    
    domain_df = df[df.domain == domain_name]
    popularity_dict = calc_raw_pop(domain_df) #we're still using our proportional scoring here
    
    df_pop = pd.DataFrame.from_dict(popularity_dict, orient='index') #dict is in df but index numbers (aka topic numbers) are random
    df_pop = df_pop.rename(index=int, columns={0:"Adjusted_Popularity"})
    
    topic_words_list = []
    for topic_num in df_pop.index: #this iterates in the right order - index numbers have real meaning
        topic_tups = lda_52_sixty.show_topic(topic_num)
        raw_words = list(i[0] for i in lda_52_sixty.show_topic(topic_num))
        string = ", ".join(raw_words)
        topic_words_list.append(string) #plug in names by index numbers
    
    df_pop['Topic_Words'] = topic_words_list
    
    df_pop = df_pop.sort_values(by='Adjusted_Popularity', ascending=False)
    
    #put the domain in each row - kind of an ugly way to preserve info about domain but it's what we've got
    domain_string = domain_name
    df_pop['Domain'] = [domain_string for number in range(len(df_pop.index))]
    
    return df_pop

**Finally, we'll modify our top_topics_by_domain() function to use this raw metric of scoring**
- I really should have restructured how I wrote these and used @decorators. Oh well.

In [57]:
def raw_top_topics_by_domain(df, domains_series):
    
    domains_dict = {}
    for domain in domains_series:
        
        domain_df = raw_popularity_by_domain(df, domain) #this is a df like the Madison example
        Topic = domain_df.iloc[0].Topic_Words #string
        Domain = domain_df.iloc[0].Domain.split() #have to remember to split(), which really just gives us one word
        
        if Topic in domains_dict: 
            domains_dict[Topic] += Domain
        else:
            domains_dict[Topic] = list(Domain)
    
    return domains_dict

In [58]:
raw_most_popular = raw_top_topics_by_domain(topics, topics.domain.unique()) 
#again, should have just put this second arg in the functio

In [59]:
raw_most_popular

{'budget, finance, fund, year, operating, city, fiscal, funds, calendar, communities': ['data.macoupincountyil.gov',
  'data.tompsc.com',
  'data.topeka.org',
  'information.stpaul.gov',
  'performance.danvilleva.gov',
  'performance.franklintn.gov'],
 'business, license, job, contracts, businesses, vendor, owned, contract, licenses, certified': ['data.cityofgainesville.org',
  'data.culvercity.org',
  'data.oxnard.org',
  'opendata.cityofhenderson.com'],
 'care, finance, provider, child, cost, plan, day, bay, government, fee': ['data.datamontana.us'],
 'census, indicators, survey, includes, indicator, community, demographics, bureau, u, level': ['data.livewellsd.org',
  'opendata.cityofmesquite.com'],
 'development, housing, economic, community, infrastructure, managed, medicaid, department, buildings, economy': ['dashboard.alexandriava.gov',
  'data.countyofriverside.us',
  'data.hampton.gov',
  'data.maine.gov',
  'stat.cityofgainesville.org'],
 'economy, employment, labor, unemploy

In [60]:
#turn this into a DataFrame for export
pop_by_portal = pd.DataFrame([raw_most_popular])
pop_by_portal = pop_by_portal.T
pop_by_portal = pop_by_portal.reset_index()
pop_by_portal.columns = ['Topic Content', 'Domains Where Topic Most Popular']

In [61]:
pd.set_option('display.max_colwidth', 250)
pop_by_portal

Unnamed: 0,Topic Content,Domains Where Topic Most Popular
0,"budget, finance, fund, year, operating, city, fiscal, funds, calendar, communities","[data.macoupincountyil.gov, data.tompsc.com, data.topeka.org, information.stpaul.gov, performance.danvilleva.gov, performance.franklintn.gov]"
1,"business, license, job, contracts, businesses, vendor, owned, contract, licenses, certified","[data.cityofgainesville.org, data.culvercity.org, data.oxnard.org, opendata.cityofhenderson.com]"
2,"care, finance, provider, child, cost, plan, day, bay, government, fee",[data.datamontana.us]
3,"census, indicators, survey, includes, indicator, community, demographics, bureau, u, level","[data.livewellsd.org, opendata.cityofmesquite.com]"
4,"development, housing, economic, community, infrastructure, managed, medicaid, department, buildings, economy","[dashboard.alexandriava.gov, data.countyofriverside.us, data.hampton.gov, data.maine.gov, stat.cityofgainesville.org]"
5,"economy, employment, labor, unemployment, insurance, credit, workforce, economic, training, federal",[data.iowa.gov]
6,"education, school, schools, students, public, governance, state, year, student, district","[dashboard.hawaii.gov, data.mass.gov, data.vermont.gov]"
7,"energy, environment, electricity, air, sustainable, action, climate, city, clean, facilities","[data.results.wa.gov, data.sustainablesm.org, performance.providenceri.gov]"
8,"finance, year, state, monthly, fiscal, payments, government, june, july, report","[data.maryland.gov, data.nj.gov, impact.stlouisco.com, opencheckbook.data.somervillema.gov, opendata.lasvegasnevada.gov, performance.cookcountyil.gov, transparency.michigan.gov]"
9,"financial, permits, building, permit, expenditures, guide, issued, construction, information, filter","[data.burlingtonvt.gov, data.cityofboston.gov, data.cityofevanston.org, data.cstx.gov, data.fortworthtexas.gov, data.lacity.org, data.montgomeryal.gov, data.nashville.gov, data.tuscaloosa.com]"


**This can provide some interesting insights, but beware of using it for portals that have only a handful of open datasets. Using the raw total views and downloads can be misleading!**

In [85]:
#pop_by_portal.to_csv('most_popular_topics_by_portal_by_raw_totals.csv') #export

## A function to return rows with a given topic number present:
- We can use this to view datasets where a given topic wins.

In [114]:
def return_data_by_top_topic(df, topic_num):
    index_list = []
    for index, tup_list in enumerate(df.topic_comp):
        if (any(topic_num in tup for tup in tup_list)): #if a topic num is in topic_comp
            win_tuple = max(tup_list, key=lambda item:item[1]) #call winning tuple
            if win_tuple[0] == topic_num: #if winning tuple is the topic we want
                index_list.append(index) #append index number to list
    
    df = df.iloc[index_list] #then just select with this list
    return df  

In [116]:
pd.reset_option('display.max_colwidth') #reset our display to default

In [124]:
twenty_six = return_data_by_top_topic(topics, 26)
twenty_six[0:200]

Unnamed: 0.1,Unnamed: 0,name,description,attribution,columns_field_name,columns_name,type,categories,domain_category,domain_tags,provenance,download_count,page_views_last_month,page_views_last_week,page_views_total,page_views_total_log,domain,mash,big_mash,topic_comp
21,21,DCHS_Safety_DVSA Hotline Calls,[],,"['sexual_assault_hotline_calls', 'other_servic...","['Sexual Assault Hotline Calls', 'Other Servic...",chart,[],,"['sexual assault', 'domestic violence', 'safet...",official,6.000,27.000,0.000,190.000,7.577,dashboard.alexandriava.gov,"['sexual', 'assault', 'domestic', 'violence', ...","['sexual', 'assault', 'domestic', 'violence', ...","[(16, 0.1616786051), (26, 0.431318681319), (50..."
22,22,Exit Plan for Leaving Shelter,[],,"['sexual_assault_hotline_calls', 'percent_of_c...","['Sexual Assault Hotline Calls', 'Percent of c...",chart,[],,"['sexual assault', 'domestic violence', 'safet...",official,4.000,18.000,0.000,189.000,7.570,dashboard.alexandriava.gov,"['sexual', 'assault', 'domestic', 'violence', ...","['sexual', 'assault', 'domestic', 'violence', ...","[(16, 0.161766956351), (26, 0.431318681319), (..."
88,88,Total People Reached Through Community Engagem...,[],,['percent_of_participants_reported_that_they_l...,['Percent of participants reported that they l...,chart,[],,"['safety', 'sa', 'dv', 'dchs']",official,3.000,9.000,1.000,67.000,6.087,dashboard.alexandriava.gov,"['safety', 'sa', 'dv', 'dchs']","['safety', 'sa', 'dv', 'dchs']","[(16, 0.203846153846), (26, 0.403846153846), (..."
101,101,Length of Stay at Shelter,[],,"['unknown', 'left_area', 'residence_of_friend_...","['Unknown', 'Left Area', 'Residence of friend/...",chart,[],,"['sexual assault', 'domestic violence', 'safet...",official,11.000,8.000,0.000,58.000,5.883,dashboard.alexandriava.gov,"['sexual', 'assault', 'domestic', 'violence', ...","['sexual', 'assault', 'domestic', 'violence', ...","[(16, 0.162007823798), (26, 0.431318681319), (..."
160,160,Adults and Children Sheltered due to Domestic ...,[],,"['residence_of_friend_relative', 'transitional...","['Residence of friend/relative', 'Transitional...",chart,[],,"['sexual assault', 'domestic violence', 'safet...",official,0.000,0.000,0.000,29.000,4.907,dashboard.alexandriava.gov,"['sexual', 'assault', 'domestic', 'violence', ...","['sexual', 'assault', 'domestic', 'violence', ...","[(16, 0.161854234981), (26, 0.431318681319), (..."
161,161,DCHS_SA_Domestic Violence Shelter Data,[],,"['over_90_days', 'length_of_stay_less_than_24_...","['Over 90 days', 'Less than 24 hours', 'Reside...",dataset,[],,"['sexual assault', 'domestic violence', 'safet...",official,1.000,0.000,0.000,29.000,4.907,dashboard.alexandriava.gov,"['sexual', 'assault', 'domestic', 'violence', ...","['sexual', 'assault', 'domestic', 'violence', ...","[(16, 0.161685334512), (26, 0.431318681319), (..."
231,231,DCHS_Safety_DVSA Hotline Calls by Type,[],,['children_sheltered_due_to_domestic_violence'...,['Children sheltered due to Domestic Violence'...,chart,[],,"['sexual assault', 'domestic violence', 'safet...",official,0.000,0.000,0.000,10.000,3.459,dashboard.alexandriava.gov,"['sexual', 'assault', 'domestic', 'violence', ...","['sexual', 'assault', 'domestic', 'violence', ...","[(16, 0.161934869861), (26, 0.431318681319), (..."
239,239,DCHS_SA_Domestic Violence And Sexual Assault C...,[],,"['total_people_reached', 'total_hours_of_activ...","['Total People Reached', 'Total Hours of Activ...",dataset,[],,"['safety', 'sa', 'dv', 'dchs']",official,2.000,0.000,0.000,9.000,3.322,dashboard.alexandriava.gov,"['safety', 'sa', 'dv', 'dchs']","['safety', 'sa', 'dv', 'dchs']","[(16, 0.203846153846), (26, 0.403846153846), (..."
252,252,Percent of Participants who Report that they L...,[],,"['race_of_participants_caucasian', 'race_of_pa...","['Race of Participants - Caucasian', 'Race of ...",chart,[],,"['safety', 'sa', 'dv', 'dchs']",official,0.000,0.000,0.000,7.000,3.000,dashboard.alexandriava.gov,"['safety', 'sa', 'dv', 'dchs']","['safety', 'sa', 'dv', 'dchs']","[(16, 0.203846153846), (26, 0.403846153846), (..."
272,272,OHA Baseline Target - Viable Land Base,"['OHA', 'Strategic', 'Results', '(2010', '2018...","Field Note: Baseline Hawai'i State, Office of ...","['annual_target_population', 'fiscal_year', 'y...","['Annual Target Population ', 'Fiscal Year', '...",chart,[],Individual Rights,"['baseline', 'land', 'oha']",official,90.000,118.000,23.000,9494.000,13.213,dashboard.hawaii.gov,"['baseline', 'land', 'oha', 'individual', 'rig...","['baseline', 'land', 'oha', 'individual', 'rig...","[(4, 0.0445430353893), (9, 0.13570023278), (21..."


# Conclusions:
**How you count "popularity" matters.**
- Nevertheless, **transportation and crime/public safety dataset popularity is resilient** across different measures of popularity that we've defined.

**When you "dampen" the effects of cities/states whose open data portals get a high volume of traffic, popularity rankings change somewhat**. 
- This suggests that "popularity" (as we've defined it) in very large and/or populated areas is different than other areas. 
- This may be because of a genuine difference in citizen preference, a difference in what data these governments choose to display prominently (everyone care about traffic, but California *really* cares about traffic), or a skew from users across the country and world accessing this data (NYC, for instance, is a popular domain for datasets for data science research projects).

**We argue that "dampened" popularity is the most appropriate metric for wide applicability to a variety of government organizations**. 
 - It appears to truly reduce the influence of large CA, NYC, and TX datasets. Not that there's anything wrong with these places or portals! But we want this information to be useful to smaller cities as well.
 - **Using dampened popularity, the following topics show notable popularity:**
    - Crime Reports and Public Safety Response
    - Personal Transportation and Traffic
    - Community Economic Development & Housing
    - Business Licensing
    - Project Permitting and Financing

**Popularity is *always* going to be affected by what a city or state chooses to display prominently on its web portal.**
- For example -- using our old metric of popularity, ‘energy environment electricity air sustainable action climate city clean facilities’ was the most popular topic in Providence, RI. Providence, RI displays a link to sustainability info right on its front page. It is hard to tell what is cause and what is effect. 
- **This is not a controlled experiment. We hope that a sample of 141 open data portals controls enough for individual variation in what datasets are easiest for users to access in a given portal**.

**No meta analysis alone can tell a city what kind of open data will be most popular, let alone most urgent to the public interest.**
- As we saw from our undampened popularity measures, cities have a variety of most popular topics. This is influenced by both local preference and local decisions about what datasets to display prominently.
- Just because a dataset is popular doesn't mean it serves the most urgent or broadest public purpose.
    - Open data users are not a cross section of voters or the general public. Permitting information is important and useful for developers. It should be open and accessible for concerned citizens as well. But other types of information may be more useful to the average citizen. 

**These results are only applicable to domains hosted by Socrata. We'd expect that a wider range of cities/states wouldn't alter results that much, but we can't be sure.**

## Caveats
### Errors, uncertainty, and additional notes:
**Again, not every dataset is tagged thoroughy, accurately, or appropriately.** 
  - Some cities/portals just give their datasets weird names or use stock descriptions for every single category of open data. That's why we've formed a bland or vague topic or two.
    
**Proper names obviously skew results somewhat; a proper name doesn't really tell us about the content of a dataset.**
  - However, it's just not feasible to remove every proper name as a stopword; at least not without extensive trial and error
  - "Dampened Popularity" best avoids this effect by requiring a cut off (0.2), only counting "winning" topics, and normalizing popularity scores *within* a city.
 
**We are only displaying the top 10 words by affinity for each topic. The topics go much deeper, picking up some marginal affinity from words that may not make as much sense to the human eye.**
   - This is also a weakness of our model. Real life, human eye topics are of varying length in terms of words. But our model forms topics of equal sizes.
   
**Why not KMeans or some other form of clustering to group datasets?**
- Those with strong backgrounds in data science may ask why we didn't turn each dataset's "mash" into a tf-idf vector and then "cluster" datasets based on their distance in mathematical space. We tried! However, the data formed by the tf-idf vectorizer was too sparse. KMeans' explicit number of clusters forced dissimilar datasets together too often. DBSCAN's ability to create clusters of varying density and assign data points to an 'outlier' cluster produced highly specific clusters that didn't cut across cities as much as we'd like. And too often a DBSCAN assigned over half of our datasets to an outlier category.
- **We'd encourage those with strong backgrounds in data science and machine learning to take a cut at this. It's possible clustering could still produce useful formations, and it would be a more stable formation our rather reverse process using LDA topic modeling.**

**The human brain interprets our clusters at the end. It's up to us to determine what our "topic" should really be named in human English. Two LDA topics may be very similar real-life topics (we can see this with our two transportation datasets - probably because that's such a prevalent real-life category).** 
  - Then again, all of this is labeled by humans. Back to our first point, there is always room for disagreement/debate in what "subject" a dataset is about, and how narrow to make subjects.