### Dataset 1 - Country Flags
The dataset below provides different information of a country in order to predict potentially what the country’s majority religion is. This falls more in line with a clustering problem, whereby flags are clustered into groups.
https://archive.ics.uci.edu/ml/datasets/Flags?fbclid=IwAR3cI_9sS9XxKBJ-RPXEIAPBOS3QDqkS7qYxicM6F_TiJB--5P8r1Tt6Lxk

#### Problem
For this problem we will find clusters of flags based on the data attributes, find the common characteristics, and output the connections that we find between the flags, in this case, the majority religion of the country

#### Extending the problem
It would be interesting to see what our algorithm would cluster a new-made up flag that we create based on our own human biases.

### Loading the Dataset
This dataset is stored on a single csv file across all of the features. Numpy can easily load in these values int oa matrix so that we can use it in our analysis.

In [100]:
import numpy as np

filedata = np.genfromtxt('./data/CountryFlags/flag.data', dtype=None, delimiter=',', encoding='utf-8')

data = [[None for _ in range(len(filedata[0]))] for _ in range(len(filedata))]

# Data is stored as mostly integers, but these correspond to a string in the data description, stored in these lists 
landmass = [None, 'N.America', 'S.America', 'Europe', 'Africa', 'Asia', 'Oceania']
quadrant = [None, 'NE', 'SE', 'SW', 'NW']
languages = [None, 'English', 'Spanish', 'French', 'German', 'Slavic', 'Other Indo-European', 'Chinese', 'Arabic', 'Japanese/Turkish/Finnish/Magyar', 'Others']
religions = ['Catholic', 'Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Others']
for i in range(len(data)):
    for j in range(len(data[i])):
        # Country Name
        if (j == 0):
            data[i][j] = str(filedata[i][j])
        # Landmass
        elif (j == 1):
            data[i][j] = landmass[filedata[i][j]]
        elif (j  == 2):
            data[i][j] = quadrant[filedata[i][j]]
        elif (j  == 5):
            data[i][j] = languages[filedata[i][j]]
        elif (j == 6):
            data[i][j] = religions[filedata[i][j]]
        else:
            data[i][j] = filedata[i][j]
        # Make the row into a numpy array
        data[i] = np.array(data[i])

# Transpose so that features are along the rows and data points are along the columns
data = np.array(data).transpose()



['Muslim' 'Marxist' 'Muslim' 'Other Christian' 'Catholic' 'Ethnic'
 'Other Christian' 'Other Christian' 'Catholic' 'Catholic'
 'Other Christian' 'Catholic' 'Other Christian' 'Muslim' 'Muslim'
 'Other Christian' 'Catholic' 'Other Christian' 'Ethnic' 'Other Christian'
 'Buddhist' 'Catholic' 'Ethnic' 'Catholic' 'Other Christian' 'Muslim'
 'Marxist' 'Ethnic' 'Buddhist' 'Ethnic' 'Other Christian'
 'Other Christian' 'Catholic' 'Other Christian' 'Ethnic' 'Ethnic'
 'Catholic' 'Marxist' 'Catholic' 'Muslim' 'Ethnic' 'Other Christian'
 'Catholic' 'Marxist' 'Other Christian' 'Marxist' 'Other Christian'
 'Muslim' 'Other Christian' 'Catholic' 'Catholic' 'Muslim' 'Catholic'
 'Ethnic' 'Other Christian' 'Other Christian' 'Other Christian'
 'Other Christian' 'Other Christian' 'Catholic' 'Catholic' 'Catholic'
 'Ethnic' 'Ethnic' 'Marxist' 'Other Christian' 'Ethnic' 'Other Christian'
 'Other Christian' 'Other Christian' 'Other Christian' 'Other Christian'
 'Catholic' 'Muslim' 'Ethnic' 'Hindu' 'Catholic' 'C

### Cleaning the Dataset
The point of this problem is to only use the data and features that we can get from a given countries flag. This dataset includes features such as population, density etc. that are not related to the flag, and should be removed.
One of these, religion, will be our label that we are aiming to predict based off of the flag. Thus we will have to extract the religion feature as a label, and eliminate the non-flag related features.

In [53]:
# Extract the religions as the labels, row 6

labels = 