# Profiling Data
\[_In case you’re unable to see the Atoti visualizations in GitHub, try viewing the notebook in [nbviewer](https://nbviewer.org/github/atoti/notebooks/blob/main/notebooks/01-use-cases/other-industries/col-data-profile/main.ipynb)._]

A frequent and important first step before any analytics project is a data profiling and cleansing exercise.  Data profiling, or examining your data to summarize its content and quality, can give a sense of:
* how complete the data is,
* how messy or inconsistent the data is,
* how it is distributed (range, frequency, et cetera),
* and other attributes.

We'll profile a Cost of Living dataset using Atoti to gain these insights.  For this, we'll start with data precisely as is-no prior cleansing.

First, we'll import Atoti and create a session.

In [1]:
import atoti as tt

In [2]:
session = tt.Session()

From here, we can read in our data, and create a cube so we can start visualizing and summarizing its key attributes.  Since this is city level data for a single year, we'll use the city and state (just in case there are duplicated city names across states) as our unique row definition.

When creating the cube, we'll create it in *no_measures* mode, and only create any measures once we've explored the data a bit.  By using *no_measures* mode, the only measure that will be created is *contributor.COUNT*.

In [3]:
COL2010 = session.read_csv(
    "s3://data.atoti.io/notebooks/col-data-profile/Cost_of_Living_Index_for_Selected_US_Cities_2010.csv",
    keys={"City", "State"},
)

In [4]:
cube = session.create_cube(COL2010, mode="no_measures")

In [5]:
cube

### Exploring data quality

We can now begin to explore our data.  We'll start by investigating how our data is distributed across each state.

We notice immediately there is some messiness afoot:
* Places like Queens, Brooklyn, and County are not States (misclassification!)
* N/A is not a state (missing data!)
* Some 'states' like MO-IL are grouped together with another state along with their standalone entries
* Virgina appears twice as VA and as Virginia (inconsistent conformance!)

In [6]:
session.visualize("Distribution of data across States")

For each of these, let's explore further what precisely is going on.  We'll start with Brooklyn, Manhattan, Queens, and County.  For these, it looks like these are associated with a city called "New"--perhaps breaking down NYC into borough level data?

Ideally, they should each be grouped under the state of NY.  Similarly, Nassau County is a county near NYC in NY state.

In [7]:
session.visualize("Exploring Brooklyn, Manhattan, Queens, & County Data")

Now, for the places without a state.  Here, it appears state data was errantly included in the city column.

In [8]:
session.visualize("Exploring N/A Data")

For the "paired" states--it appears these are border cities or sister cities which make up one metro area.

In [9]:
session.visualize("Exploring Hyphenated States Data")

And, finally, for Virginia.  It appears Hampton's data was entered without comforming to the two letter state abbreviation--but thankfully is not duplicated.

In [10]:
session.visualize("Exploring Virginia Data")

### Exploring data distribution

Going back to how the data was distributed, even accounting for things like Virgina vs VA vs DC-VA, Texas by far and away provides the largest # of data points (31 contributions), with North Carolina providing the second most (16 contributions).

In [11]:
session.visualize("Alt View-Distribution of data across States")

Let's see what column data we have for these locations.  For these columns, we can add new measures to investigate the min, max, and mean.

In [12]:
COL2010.columns

['City',
 'State',
 '100% Composite Index',
 '13 % Grocery Items',
 '29 % Housing',
 '10% Utilities',
 '12 % Transportation',
 '4% Health Care',
 '32 % Miscellaneous Goods and Services']

In [13]:
for column_name in [
    column_name
    for column_name in COL2010.columns
    if column_name not in {"City", "State"}
]:
    column = COL2010[column_name]
    cube.measures[f"{column_name}.MIN"] = tt.agg.min(column)
    cube.measures[f"{column_name}.MAX"] = tt.agg.max(column)
    cube.measures[f"{column_name}.MEAN"] = tt.agg.mean(column)
    cube.measures[column_name] = tt.where(
        ~cube.levels["City"].isnull(), tt.agg.single_value(column)
    )

In [14]:
cube.measures

In [15]:
session.visualize("COL score per city")

Cost of living score is a comparative score, where the percentage in the column headers are indicating the target percent of a budget that category should be, and the score per city indicates how much above or below national average that location is.  Thus, for a place like Brooklyn, the data is demonstrating it is 11.5% more expensive for healthcare than average, and 81.70% more expensive than average overall.

Let's look at how the COL-Utility scores compare across states.  We'll explore the max, min, and mean for each state.  Scrolling through the data, nothing seems to jump out for this category.

In [16]:
session.visualize()

Let's look at how our 100% Composite Index is distributed across each city within each state.  We'll first create a benchmark measure of 100 to compare against.

In [17]:
cube.measures["100"] = 100

In [18]:
session.visualize("Distribution of COL scores")

There seem to be some outliers in our data.   We can always pan and zoom to temporarily ignore the outliers around the 1000 mark--and perhaps set a note to investigate the data for those locations which are so greatly out of bounds.

In zooming in, we can see for certain states, like Massachusetts, every city's 100% Composite Index score is greater than 100, whereas for a state like Nebraska, both are below 100%.

Let's look at the min, max, and mean for each state.

In [19]:
session.visualize("Utility COL score per city")

And since we have so many contributions from Texas, let's explore the distribution of data in Texas a bit more closely.  There are no cities in Texas for which all categories have a score greater than 100, while there are cities for which all categories are below 100 (Amarillo, Brownsville, Waco, and Wichita, to name a few).

The data for Paris, Texas looks a bit odd.  It seems the COL score for housing is 8.  This looks like data worth verifying.

In [20]:
session.visualize("COL Scores for Texas Cities")

We can continue this exercise for each category, investigating the data and earmarking phenomenons worth exploring further.

### Conclusion

With Atoti, we are able to leverage the power of visualizations and a few simple stats to profile our data and create a plan for how we want to cleanse it and what possible analytics we may want to do with our data.

<div style="text-align: center;" ><a href="https://www.atoti.io/?utm_source=gallery&utm_content=COL-data-profile" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/Your-turn-to-try-Atoti.jpg" alt="Try Atoti"></a></div>