# Extracting

Let us begin by extracting our data into Pandas dataframes. I have created a simple python module just for that task. 

All dataframes will have duplicate rows deleted and empty strings will be substituted with NaN values. This is a very basic transformation that can be done for almost any data. Further transformations will depend on what we see in our dataframes.

In [None]:
import extract, config
import pandas as pd

# Specifying confgiguration path and initiating an Extractor
datasources_config_path = './config/data_source_config.json'
extractor = extract.Extractor(datasources_config_path)

# Extracting dataframes
product_invoices_df = extractor.load_json_files_to_df('product_invoices')
product_package_types_df = extractor.load_json_files_to_df('product_package_types')
product_shipments_df = extractor.load_json_files_to_df('product_shipments')
provider_invoices_df = extractor.load_json_files_to_df('provider_invoices')
provider_prices_df = extractor.load_json_files_to_df('provider_prices')


# Checking and Transforming

Now let us check our dataframes one by one and fix anything funky that might be going on with them.

I will focus on missing data and potential primary keys that are not unique.

In [None]:
import transform
transformer = transform.Transformer()

In [None]:
# The list of transactions where we have collected the shipment fee from the buyers
product_invoices_df

In [None]:
# Basic checks
transformer.check_if_unique_id(product_invoices_df, 'transaction_id')
transformer.check_missing_values(product_invoices_df)

The red background returned here looks scary but that is just my logs doing what they are supposed to - we can see that the transaction_id is unique and there are no missing values. Moving on to the next dataframe!

In [None]:
# The list of package types we have on our platform that a seller can select.
product_package_types_df

The description data in the product_package_types_df does not look uniform. Some values include both dimensions and weight of the packages. Dimensions are not referred to in any of the other tables. We will transform this data so that a weight value is saved in a separate column. This way we can make the most of it - for example, we can later compare whether the shipping information matches the information provided by the sellers.

In [None]:
product_package_types_df['weight_reported'] = product_package_types_df['description'].apply(transformer.cleanup_description)
product_package_types_df

Now that looks much more usable! I will do the rest of the checks now.

In [None]:
# Basic checks
transformer.check_if_unique_id(product_package_types_df, 'id')
transformer.check_missing_values(product_package_types_df)

We can see there were missing description values, but the new weight_reported transformed them - you can't see it here, but it set them to the number 0. This still doesn't seem completely right, but we won't drop these values as their IDs migth still be referred to in other tables.

Let's move on to the next dataframe.

In [None]:
# The shipping labels that the shipping provider is charging us for.
provider_invoices_df

In [None]:
# Basic checks
transformer.check_if_unique_id(provider_invoices_df, 'tracking_code')
transformer.check_missing_values(provider_invoices_df)

In [None]:
# Finding out the values with duplicated tracking_code
ids = provider_invoices_df["tracking_code"]
provider_invoices_df_duplicated_tracking = provider_invoices_df[ids.isin(ids[ids.duplicated()])]
provider_invoices_df_duplicated_tracking

We can see that the tracking code is not unique and there are 4 rows affected. To me, this seems like a data quality issue. In order to join this dataframe with other dataframes, we can either drop the above data or keep the most recent record. I am choosing to drop it in the interest of faster analysis but in a real life scenario, I would flag such an issue and keep hold of that data.

In [None]:
# Dropping duplicated values and checking whether tracking_code is unique now - this can be extracted in a separate function
duplicates = provider_invoices_df_duplicated_tracking.tracking_code.unique()
provider_invoices_df = provider_invoices_df[~provider_invoices_df.tracking_code.isin(duplicates)]

transformer.check_if_unique_id(provider_invoices_df, 'tracking_code')

Another transformation we can do here is to convert the measured weight from gr to kg. I see other tables have weight saved in kg, so such transformation will make future comparisons easier.

In [None]:
#Transforming provider_invoices_df weight to be in kg
transformed_provider_invoices_df = provider_invoices_df.copy();
transformed_provider_invoices_df["weight_measured"] = provider_invoices_df["weight_measured"]/1000
transformed_provider_invoices_df

On to the next dataframe.

In [None]:
# The list of shipments that we see in our data
product_shipments_df

In [None]:
# Basic checks
transformer.check_if_unique_id(product_shipments_df, 'transaction_id')
transformer.check_if_unique_id(product_shipments_df, 'tracking_code')
transformer.check_missing_values(product_shipments_df)

In [None]:
# Finding out the values with duplicated tracking_code
ids = product_shipments_df["tracking_code"]
product_shipments_df_duplicated_tracking = product_shipments_df[ids.isin(ids[ids.duplicated()])]
product_shipments_df_duplicated_tracking

The same repeated tracking_ids can be found in the product_shipments_df data and these codes are clearaly associated with different shipments. We will get rid of them for now in order to be able to join the product_shipments and provider_invoices data.

In [None]:
# Dropping duplicated values and checking whether tracking_code is unique now - this can be extracted in a separate function
duplicates = product_shipments_df_duplicated_tracking.tracking_code.unique()
product_shipments_df = product_shipments_df[~product_shipments_df.tracking_code.isin(duplicates)]
transformer.check_if_unique_id(product_shipments_df, 'tracking_code')

I wonder whether the repeated tracking code will be found in the product_invoices data - we can find that based on the unique transaction IDs we saw for the tracking codes in the  product_shipments data.

In [None]:
product_invoices_df

duplicates = [8638716, 271780177, 7789938, 260812237]

product_invoices_df.loc[product_invoices_df['transaction_id'].isin(duplicates)]
# product_invoices_df = product_invoices_df.set_index(['transaction_id'])
# print(product_invoices_df.loc[7789938])

The repeated tracking code is in the product_invoices data. We can drop it but when we join this table with the others, it won't matter as the values will be dropped anyway. 

There is one more dataframe to check.

In [None]:
# The prices that we are being charged by the provider based on weight, route of the shipment.
provider_prices_df

This seems like reference data from the provider, which we can use to double-check whether the provider charges us fairly. We could remove the kg in actual_package_size, which would make comparisons easier. I will leave this task for later.

To make things easier - let's join product_shipments_df, product_invoices_df, and provider_invoices_df! This will allow us to have one main dataset, which will make explorations and comparisons easier. We know we can treat transaction_id as a unique key in product_invoices_df and product_shipments_df. Now we also know that tracking_code is unique and since it exists in both product_shipments_df and provider_invoices_df, we can easily perform the join.

I will first rename the columns that exist in more than one table so we don't get confused when the join happnes.

In [None]:
renamed_product_shipments_df = product_shipments_df.rename(columns={'from_country':'from_country_product', 'to_country':'to_country_product'})
renamed_product_invoices_df = product_invoices_df.rename(columns={'amount':'amount_product'})
renamed_provider_invoices_df = transformed_provider_invoices_df.rename(columns={'from_country':'from_country_provider', 'to_country':'to_country_provider', 
                                     'amount':'amount_provider'})

product_invoices_shipments_df = pd.merge(renamed_product_shipments_df, renamed_product_invoices_df, on='transaction_id')
product_provider_invoices_shipments_df = pd.merge(product_invoices_shipments_df, renamed_provider_invoices_df, 
                                                  on='tracking_code')

full_df = pd.merge(product_package_types_df[['weight_reported', 'id']],product_provider_invoices_shipments_df,
                                                  left_on='id', right_on='package_type_id', how='right')

full_df

I am noticing that we have dropped some records when joining product_shipments_df, product_invoices_df, and provider_invoices_df. It would be useful to check whether there are provider invoices that do not correspond to any product invoices, but for the sake of the current task, I am assuming any such discrepancies might be because of incomplete data and I will focus on more exciting explorations.

The next thing we can do is to check whether the duplicated columns in the tables we just joined match and if not - we can drop them.

In [None]:
# Checking whether duplicated columns from joined tables match. This can be more elegant.
isFromCountrySame = full_df['from_country_product'].equals(full_df['from_country_provider'])
if isFromCountrySame:
     full_df.drop('from_country_provider', axis=1, inplace=True)
    
isToCountrySame = full_df['to_country_product'].equals(full_df['to_country_provider'])
if isToCountrySame:
     full_df.drop('to_country_provider', axis=1, inplace=True)

In [None]:
# Comparing amount columns
isAmtSame = full_df['amount_product'].equals(full_df['amount_provider'])
print(isAmtSame)

Now this is interesting - we can see that the amount in the product invoices differs from the amount in the provider invoices. We need to explore further. Something that might be useful is a new

# Exploring

Starting with a basic description of the numeric values in our new main table.

In [None]:
full_df.describe()

We can see that the amount in the product invoices is slightly higher on average than the amount in the provider invoices. The lower standart deviation of the amount in the product invoices suggest that data is a bit less varied and tends to be closer to the mean compared to the amount in the provider invoices. But when is the discrepancy observed? Bring on the graphs - we are about to find out. 

In [None]:
# Distribution graph for amount_provider
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(rc={'figure.figsize':(11.7,8.27)})
full_df.round({'amount_product': 2, 'amount_provider': 2})
plotAmtProvider = sns.countplot(x='amount_provider',data=full_df.round(2))
plt.xlabel('Amount')
plt.ylabel('Count')
plt.title('Distribution of amount in provider invoices')

In [None]:
# Distribution graph for amount_product
plotAmtProduct = sns.countplot(x="amount_product",data=full_df.round(2))
plt.xlabel('Amount')
plt.ylabel('Count')
plt.title('Distribution of amount in product invoices')

We can see that the majoritiy of amounts for both product and provider invoices are one of three values. Additionally, as expected, the distributions differ and the product invoices have more invoices with higher amounts.

Now let's see how the amounts in both types of invoices have changed over time.

In [None]:
# Changing the type of some columns to aid further exploration
full_df[["shipping_label_created", "user_invoice_date"]] = full_df[["shipping_label_created", "user_invoice_date"]].apply(pd.to_datetime)
full_df["shipping_label_created_year_month"] = full_df["shipping_label_created"].dt.strftime('%Y-%m')
full_df['weight_reported'] = full_df['weight_reported'].astype(float)

# Creating a dataframe with mean values per month
mean_full_df = full_df.groupby('shipping_label_created_year_month').mean()

In [None]:
# Line plot - amount
sns.lineplot(x="shipping_label_created_year_month", y="amount_product", data=mean_full_df)
sns.lineplot(x="shipping_label_created_year_month", y="amount_provider", data=mean_full_df)
plt.xlabel('Year-Month')
plt.ylabel('Amount')
plt.xticks(rotation=90)
plt.legend(['Product', 'Provider'])
plt.title('Mean amount in product invoices versus mean amount in provider invoices shown per month')
plt

First of all, ignore the INFO logs - I would configure multiple loggers so that the graphs are not affected by INFO logging but I am short on time.

We can see that the discrepancy between the product and provider amount has reduced with time and the most recent data has almost identical mean amounts.

Let's see the same relationship in regard to the mean weight customers selected versus the mean weight recorded by the provider. 

In [None]:
# Line plot - weight
sns.lineplot(x="shipping_label_created_year_month", y="weight_reported", data=mean_full_df)
sns.lineplot(x="shipping_label_created_year_month", y="weight_measured", data=mean_full_df)
plt.xlabel('Year-Month')
plt.ylabel('Weight in kg')
plt.xticks(rotation=90)
plt.legend(['Weight reported by customers', 'Weight recorded in provider invoices'])
plt.title('Mean weight measured by provider and mean weight reported by customers per month')
plt

On average, it seems like the weight selected by users is higher than the actual reported weight. This discrepancy makes sense given the weight measured by providers is the actual weight and the weight provided by sellers is an estimation of the maximum weight their package could be. 

The mean reported and measured weight are getting closer with time, which indicates that seller's estimations are getting better - perhaps due to more options for package sizes. Average weights of packages also seems to be increasing - as Vinted was getting more popular, perhaps it was getting more common for sellers to ship bigger items or several items in the same package.

Now let's step away from mean values and explore counts.

In [None]:
# Creating a dataframe with unique values per month
unique_full_df = full_df.groupby('shipping_label_created_year_month').nunique()
unique_full_df

I am noticing that the unique number of buyers and sellers is consistently similar and it is also increasing over time. Besides user increase, another growth measure is the number of transactions over time, which I will plot.

In [None]:
# Line plot - transactions
sns.lineplot(x="shipping_label_created_year_month", y="transaction_id", data=unique_full_df)
plt.xlabel('Year-Month')
plt.ylabel('Number of transactions')
plt.xticks(rotation=90)
plt.title('Number of transactions per month')
plt

The number of transactions has been increasing steadily since 2017. The beginning of this growth seem to correlate with a drop in the mean shipping price charged by provider. I am assuming a better deal was negotiated in June 2017 due to the increase in shippings.

I am curious whether the discrepency between the product and provider invoices depend on the type of shipping - local vs international.

In [None]:
# Preparing data so we can see split by local and international shipping
full_df['is_trans_country'] = (full_df['from_country_product']==full_df['to_country_product']).astype(int)
full_df

mean_full_df_is_trans = full_df.groupby(['is_trans_country'])
mean_full_df_is_trans_false = mean_full_df_is_trans.get_group(0)
mean_full_df_is_trans_true = mean_full_df_is_trans.get_group(1)

mean_full_df_is_trans_false_mean = mean_full_df_is_trans_false.groupby(['shipping_label_created_year_month']).mean()
mean_full_df_is_trans_true_mean = mean_full_df_is_trans_true.groupby(['shipping_label_created_year_month']).mean()

# Line plot - amount for international shippings
sns.lineplot(x="shipping_label_created_year_month", y="amount_product", data=mean_full_df_is_trans_true_mean)
sns.lineplot(x="shipping_label_created_year_month", y="amount_provider", data=mean_full_df_is_trans_true_mean)
plt.xlabel('Year-Month')
plt.ylabel('Amount')
plt.xticks(rotation=90)
plt.legend(['Trans Country - Product', 'Trans Country - Provider'])
plt.title('Mean amount in product invoices versus mean amount in provider invoices for international shippings shown per month')
plt

Product invoices versus mean amount in provider invoices for international shippings looks very similar to the graph we had for all shippings.

In [None]:
# Line plot - amount for local shippings
sns.lineplot(x="shipping_label_created_year_month", y="amount_product", data=mean_full_df_is_trans_false_mean)
sns.lineplot(x="shipping_label_created_year_month", y="amount_provider", data=mean_full_df_is_trans_false_mean)

plt.xlabel('Year-Month')
plt.ylabel('Amount')
plt.xticks(rotation=90)
plt.legend(['In Country - Product', 'In Country - Provider'])
plt.title('Mean amount in product invoices versus mean amount in provider invoices for local shippings shown per month')
plt

It looks like the data for local shippings covers only a small period but is still consistent with the trends we found so far.

# Conclusion

### Summary of the findings

There is a discrepancy between the amounts in the provider’s invoices and the product’s invoices but this discrepancy seems to be reducing with time.

There is also a difference in the weights measured by providers and the weights reported by customers.

The data shows Vinted's obvious growth rate. The beginning of the growth is associated with a decrease in the mean amount in the provider’s invoices. With an increase in shippings, Vinted could have negotiated lower prices with the shipping companies as it became a bigger customer.

### Limitations and further explorations

A lot of my graphs depended on means, which only gives a glimpse at the data. I also didn't have the time to use the provider_prices data, which could have helped explore whether providers are overcharging Vinted. 

I could have integrated Tableau, which would have made the exploration faster.

Churn is another important metric, which could have been explored with this dataset - I could have looked at customer retention rate assuming that the buyer/seller IDs are unique to a buyer/seller and do not change.

This type of data seems great for future predictive modelling with linear regression for example.

It would be useful to get information regarding different changes Vinted has made, such as new deals with providers/new providers, new package sizing systems.  

### Design

The design of the current solution could be improved. My Jupyter notebook is used as the orchestrator of the application but that would not be the case in a production environment.

I would imagine the application to have batch processing, which would allow for monitoring of some key indicators daily and/or weekly. Reference data can be stored  separately so we don’t load and clean it all the time. As we load files, we need to keep track of the files that have already been loaded. Transformations and visualisations should be abstracted, which I could have done if I had more time. Caching can be used for faster processing. The loading process is not part of this task but the data could be stored in a warehouse. There could also be a separation of different independent pipelines, which would enable multiprocessing. Usage of cloud solutions would provide an additional performance boost. Clear uniform naming conventions and following specific standards across all data pipelines would also benefit a growing system.
