# Data Scientist Take-home Challenge


This document contains list of the tasks we would like you to complete in order to evaluate fully your technical abilities. However, if by some reason you cannot get all results in full, don’t worry. The purpose of this exercise is not to get exact numbers, but to understand how you would approach similar assignments, what would be your way of thinking and how you would advise your Client at the end.


You are to work in three separate tasks. Each requires some data handling, modelling and visualizations. We would like to get following outputs from your side:

- All the pseudo-code / code from your software of preference. Please don’t forget to include notes and comments within.
- Numerical outputs of the final models and statistical calculations (if applicable for the specific task)
- Any visualizations of the data you consider useful to support your work (interactive charts would be considered as an advantage)


## Task 1 - Basic Data Handling and Presentation

You are provided with a small subset of data in `data_task1.csv` on in-store purchases by a leading chain of supermarkets. You are required to provide your input on the following questions:

- First, please identify and visualize which store has the highest turnover within the available time period.

- Second, please identify and present in sensible and convenient manner which combination of 3 items is the most frequently appearing in a single transaction (a transaction is indicated by bon_id_int).

- Third, please compute whether buying item with d_global_item_id = 115677 makes it more probable to also buy item 84872, please write your arguments, as well.

## Task 2 - Data Wrangling

You are provided with extracts from two datasets in `data_task2_extract_1.csv` and `data_task2_extract_2.csv`, containing company information. 

The first dataset has standardized publically available information obtained via web scrapping, the second dataset contains confidential internal company information. 

In full, the datasets contain tens of thousands of observations so manual processing is not feasible. The task at hand is to match the data items between the two datasets as best as possible. The primary matching criteria is Address (identified by fields **Address Name**, **City** and **PostCode**). However, address fields are often inputted by humans, thus conventions vary widely. 

Please prepare an automated approach for canonicalization. Your code should handle discrepancies such as representing *Strasse* in full or shortened (*Str.*). The algorithm need not handle for all possible discrepancies, as this is hard to implement without the full data, description of an iterative procedure is sufficient.

**Bonus** - One company might appear with different addresses. Imagine the company building / offices are at an intersection and the company records contain registration on each of the two intersecting streets, i.e. addresses are different. Please formulate an approach to handle for such cases. Sample implementation will yield more points.

## Task 3 - Data  Modelling

You are provided with a small-scale survey data investigating grocery purchasing behaviour and habits of Croatians. First, please review the enclosed pdf file. This is the survey questionnaire describing all key information on how data had been gathered and each interview conducted.

The task at hand is to cluster the respondents in relatively homogenous groups. Start with all informative variables in the dataset. If you deem it necessary, you may create any number of additional derivative variables or include of publically available data. In case, your classification is based on specific subset of variables, please clearly describe why and how you have arrived at precisely this subset.

Write up: as clearly as possible please describe the procedure you have applied, the results and the insights you would draw from the segmentation. Parameter selection must be justified. The write-up should not exceed one page, 1.5-spaced, Arial, 12pts, standard margins.