# Project Introduction - Customer Segmentation
The goal of this project is to analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population with the aim of understanding which segment of the population the company should aim for in its next campaign. 

The project is divided into __two main parts__:
1. In the first part the customers will be segmented using a __unsupervised learning approach__ and using the customer's data against the general population's data. Not only is the goal to understand which groups of customers are more interesting for the campaign but also to select the most important features.
2. Using the information gained a __supervised learning algorithm__ will be used for a classification task to predict which recipients are most likely to become a customer for the mail-order company. 

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic word for producing visualizations in notebook
%matplotlib inline

# 1. Load and Understand the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. 

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. Otherwise, all of the remaining columns are the same between the four data files. 

For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace:
- The file [DIAS Information Levels - Attributes 2017.xlsx](https://github.com/bruno-f7s/portfolio/blob/main/arvarto-customer-segmentation/data-dictionary/DIAS%20Attributes%20-%20Values%202017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. 
- The file [DIAS Attributes - Values 2017.xlsx](https://github.com/bruno-f7s/portfolio/blob/main/arvarto-customer-segmentation/data-dictionary/DIAS%20Information%20Levels%20-%20Attributes%202017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In [4]:
# load in the data
customers_df = pd.read_csv('data\\Udacity_CUSTOMERS_052018.csv', sep=';', low_memory=False)
population_df = pd.read_csv('data\\Udacity_AZDIAS_052018.csv', sep=';', low_memory=False)