## Data Smells

Any time you are given a dataset from anyone, you should immediately be suspicious. Is this data what I think it is? Does it include what I expect? Is there anything I need to know about it? Will it produce the information I expect?

One of the first things you should do is give it the smell test.

Failure to give data the smell test [can lead you to miss stories and get your butt kicked on a competitive story](https://source.opennews.org/en-US/learning/handling-data-about-race-and-ethnicity/).

Let's look at arrest data for Fairfax County, Va. You can find the `arrest.csv` file in this repository.

In [1]:
import agate
arrests = agate.Table.from_csv('arrest.csv')

With data smells, we're trying to find common mistakes in data. For more on data smells, read the GitHub wiki post that started it all. The common mistakes we're looking for are:

* Missing data
* Gaps in data
* Wrong type of data
* Outliers
* Sharp curves
* Conflicting information within a dataset
* Conflicting information across datasets
* Wrongly derived data
* External inconsistency
* Wrong spatial data
* Unusuable data, including non-standard abbreviations, ambigious data, extraneous data, inconsistent data

Not all of these data smells are detectable in code. You may have to ask people about the data. You may have to compare it to another dataset yourself. Does the agency that uses the data produce reports from the data? Does your analysis match those reports? That will expose wrongly derived data, or wrong units, or mistakes you made with inclusion or exclusion.

But with several of these data smells, we can do them first, before we do anything else. First, let's look at **Wrong Type Of Data**. We can sniff that out by simply printing the table structure that Agate has discovered for us.

In [4]:
print arrests

|-----------------+---------------|
|  column_names   | column_types  |
|-----------------+---------------|
|  LName          | Text          |
|  FName          | Text          |
|  MName          | Text          |
|  Age            | Number        |
|  DateArr        | Date          |
|  Charge         | Text          |
|  Charge Descrip | Text          |
|  Address        | Text          |
|-----------------+---------------|



Things seem to look good for this file. The name columns are text, the age column is a number, the date is a date, etc.

The second smell we can find in Agate is Missing Data. We can do that through a series of Group By and Count steps. Let's start with Charge Descrip.

In [8]:
ages = arrests.group_by('Charge Descrip')
charge_counts = charges.aggregate([('charge_total', agate.Count('Charge Descrip'))])
charge_counts = charge_counts.order_by('charge_total', reverse=True)

In [9]:
charge_counts.print_table(max_column_width=50)

|-----------------------------------------------------+---------------|
|  Charge Descrip                                     | charge_total  |
|-----------------------------------------------------+---------------|
|  LICENSE: DRIVE W/O                                 |          397  |
|  LIC REVOKED: DR W/O LICENSE, 1 OFF                 |          385  |
|  DRUNK IN PUBLIC OR PROFANE                         |          177  |
|  DRUGS: POSSESS MARIJUANA, 1ST OFF                  |          161  |
|  PETIT LARCENY: <$200 NOT FROM A PERSON             |           98  |
|  ASSAULT: ON FAMILY MEMBER                          |           96  |
|  RECKLE/20 MPH OVER LIMIT                           |           92  |
|  GRAND LARCENY: $200+ NOT FROM A PERSON             |           88  |
|  DRUGS: POSSESS SCH I OR II                         |           61  |
|  RECK DR: GENERALLY, ENDANGER LIFE/LIMB/PROPERTY    |           59  |
|  LIC REVOKED: DR W/O LICENSE, 2 OFF                 |         

There's a lot of data here, but be sure to focus on the last row - it's blank, and the count is 0, which means there are no rows that are missing a Charge Descrip. That's good, because it means we have no missing data in that column. You can try the same process out using another column like `Age` or `FName`.

Let's now look at **Gaps in Data**. It's been my experience that gaps in data often have to do with time, so let's first look at arrests by month, so we can see what our arrest data covers. You'd expect the number to change, but not by huge amounts. Huge differences could indicate, more often than not, that the data is missing. To do this, we'll need to calculate the month from the date.

In [11]:
arrests_with_months = arrests.compute([
    ('arrest_month', agate.Formula(agate.Text(), lambda row: '%s' % row['DateArr'].month))
])

In [12]:
months = arrests_with_months.group_by('arrest_month')

In [13]:
month_counts = months.aggregate([
    ('count', agate.Count('Charge Descrip'))
])
month_counts = month_counts.order_by('count', reverse=True)

In [14]:
month_counts.print_table()

|---------------+--------|
|  arrest_month | count  |
|---------------+--------|
|  2            | 2,949  |
|  1            |   164  |
|  12           |   130  |
|  11           |    99  |
|  10           |    72  |
|  9            |    60  |
|  8            |    13  |
|  4            |     5  |
|  6            |     4  |
|  5            |     2  |
|  3            |     1  |
|---------------+--------|


This looks a little weird. Yes, the bulk of charges came in February, which is pretty recent, with smaller amounts in January and December. But 60 from September and 72 from October? What's with those? This would mean additional reporting - maybe this is standard practice, or can be easily explained. Maybe an analysis would just include arrest records from the most recent month.

Looks like we should compile some questions to ask the Fairfax County Police Department.

### Assignment

What about the Age column - are there outliers there, and what does Age tell you about the contents of the dataset? Is Address data standardized, or are there variations? Are there errors, like misspellings, in the data? Are there outliers? What are the more interesting and potentially newsworthy charges? What steps in [Agate](https://agate.readthedocs.org/en/1.3.0/) can you take to find out?

Try exploring the data using some of the steps listed above, but with other columns. You can use many of the same functions we have, just change the column name. Add them to this notebook, along with some questions you'd ask the Fairfax police, then add, commit and push this notebook file (`arrests.ipynb`) to Github.