# Checking web app data against source

we have a dashboard on our web app that gets the data from our reporting database. However, that reporting database was recently populated with data migrated from legacy systems. 

We want to verify that the dashboard and the database are showing the correct data, so we will need to verify it against an extract of the data from the legacy system on a csv or xlsx file.

Let's begin!


In [None]:
import pandas as pd
import numpy as np

In [None]:
#after familiarizing myself with the file I know I only need some of the columns and not all of them
columns = ['IndividualCampus','Course','AttendedDate','MonthofClass','YearofClass']
df = pd.read_excel('###.xlsx',sheet_name='Data',usecols=columns)

In [None]:
#let's also rename the columns
names = {'IndividualCampus':'campus','Course':'class','AttendedDate':'date','MonthofClass':'month','YearofClass':'year'}
df.rename(names, axis=1, inplace=True)
df.head()

The dashboard is summarized by year and by campus. It also shows data only for last three years.

So, let's use a df pivot to ageregate by campus and also for all campus per year. Let's also filter the dataframe for just the last three years

In [None]:
data = df[df['year'].isin([2017,2018,2019])].copy()
data.head()

In [None]:
data_101 = data[data['class'].str.contains('101')].groupby(by=['campus','year'],as_index=False)['date'].count().rename({'date':'total'},axis=1)

In [None]:
data_101

In [None]:
data_101.shape

In [None]:
#let's now bring in these values from the database
import sqlalchemy
import pyodbc

In [None]:
server = '####'
database = '####'
username = '####'
password = '######'
driver= '{ODBC Driver 17 for SQL Server}'
cnxn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433'+';DATABASE='+database+';UID='+username+';PWD='+password)
cursor = cnxn.cursor()

In [None]:
with cnxn:
    with cursor as crs:
        string=('SELECT #####')
    crs.execute(string)

In [None]:
db_101 = pd.read_sql(string,con=cnxn)

In [None]:
db_101

In [None]:
db_101.shape

In [None]:
#let's now merge the two dataframes to compare
compare_101 = data_101.merge(db_101,how='outer',left_on=['year','campus'],right_on=['year','CampusName'],sort=True)
compare_101

In [None]:
compare_101.fillna(0,inplace=True)
compare_101.rename({'total_x':'total_xlsx','total_y':'total_database'},inplace=True, axis=1)
compare_101['total_xlsx']=compare_101['total_xlsx'].astype(dtype='int64')
compare_101

In [None]:
differences_101 = compare_101.copy()
differences_101['diff']=differences_101['total_xlsx']-differences_101['total_database']
differences_101

In [None]:
#let's add some color to ease the spotting of errors
def color_negative_red(val):
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color
differences_101 = differences_101.style.applymap(color_negative_red,subset=pd.IndexSlice[:,'diff'])
differences_101

## Observations

It seems like 38 out of 46 the results on the database are different from our source data.
Let's investigate this a little bit more.
I will:
1. Verify that the source data is accurate
2. Verify the database query to make sure I'm querying the right information
3. Decide on next steps

### Next Steps:

1. Verify that the API call has the same results that we got with our query for 101 class numbers per campus
2. If the first passes, Verify the data on the database to search for duplicates or other possible sources of errors

In [None]:
#we got the API results on a excel file. Let's bring them in
api_101 = pd.read_excel('api_class101.xlsx',sheet_name='class101',index_col=0)
api_101

In [None]:
#let's see our dataframe from our query before
db_101.head(10)

In [None]:
#let's pivot this dataframe to matc h the structure of our api df
pivoted_101 = pd.pivot_table(db_101,values=['total'],index='CampusName',columns=['year'])

In [None]:
pivoted_101

In [None]:
pivoted_101.columns

In [None]:
pivoted_101.columns = pivoted_101.columns.droplevel()

In [None]:
#pivoted_101.rename({'CampusName':'campus','2017':'2017','2018':'2018','2019':'2019'},inplace=True)

In [None]:
del pivoted_101.columns.name

In [None]:
api_vs_query = api_101.merge(pivoted_101,how='inner',left_on='campus',right_on='CampusName',suffixes=('_api','_query'))
api_vs_query

In [None]:
api_vs_query.columns

In [None]:
#let's rename these columns
api_vs_query.rename({2017:'2017_query',2018:'2018_query',2019:'2019_query'},inplace=True,axis=1)

In [None]:
api_vs_query

### Observations

Since it's a very short list we can quickly see that there are no differences between our query and the api call.

Given that these two are querying the same table on the database, and that the two qere created with different queries for the same purpose, and that both offer the same result, we can assume that the problem is not on the query itself but on the data.

Therefore, we now proceed at looking at the actual data and finding duplicates or other possible sources of errors.

In [None]:
#let's create another query to the database to get all the rows without duplicate person ID's
with cnxn:
    with cursor as crs:
        string=('SELECT #####')
    crs.execute(string)

In [None]:
new_query = pd.read_sql(string,con=cnxn)
new_query.head()

In [None]:
#let's now join this with our data to check new differences if any
compare_101_noDuplicates = compare_101.merge(new_query,how='outer',left_on=['year','campus'],right_on=['EventYear','CampusName'],sort=True)
compare_101_noDuplicates.drop(['CampusName_x','CampusName_y','class','EventYear','total'],axis=1,inplace=True)
compare_101_noDuplicates['diff'] = compare_101_noDuplicates['total_xlsx'] - compare_101_noDuplicates['noDuplicates']
compare_101_noDuplicates


In [None]:
#let's add some color to ease the spotting of errors
def color_values(val):
    if val < 0:
        color = 'red'
    elif val > 0:
        color= 'green'
    else:
        color = 'black'
    return 'color: %s' % color
colored_errors101 = compare_101_noDuplicates.style.applymap(color_values,subset=pd.IndexSlice[:,'diff'])
colored_errors101

## Conclusion

As we can see, removing the duplicate personIds does help in some cases. In other cases it actually seems to have removed valid cases of a duplicate person ID.
Also, even though we have removed duplicate person ID's we still have several instances with innaccurate data.

Further investigation is required on the warehouse data and that is outside of the scope of the current research.

In [None]:
writer = pd.ExcelWriter('class101_errors.xlsx',engine='xlsxwriter')
compare_101_noDuplicates.to_excel(writer,sheet_name='class101')
writer.save()
