### Dataset 1 - Country Flags
The dataset below provides different information of a country in order to predict potentially what the country’s majority religion is. This falls more in line with a clustering problem, whereby flags are clustered into groups.
https://archive.ics.uci.edu/ml/datasets/Flags?fbclid=IwAR3cI_9sS9XxKBJ-RPXEIAPBOS3QDqkS7qYxicM6F_TiJB--5P8r1Tt6Lxk

#### Problem
For this problem we will find clusters of flags based on the data attributes, find the common characteristics, and output the connections that we find between the flags, in this case, the majority religion of the country

#### Extending the problem
It would be interesting to see what our algorithm would cluster a new-made up flag that we create based on our own human biases.

### Dataset Description

### Loading the Dataset
This dataset is stored on a single csv file across all of the features. Numpy can easily load in these values int oa matrix so that we can use it in our analysis.

In [1]:
import numpy as np

filedata = np.genfromtxt('./data/CountryFlags/flag.data', dtype=None, delimiter=',', encoding='utf-8')

data = [[None for _ in range(len(filedata[0]))] for _ in range(len(filedata))]

# Data is stored as mostly integers, but these correspond to a string in the data description, stored in these lists 
landmass = [None, 'N.America', 'S.America', 'Europe', 'Africa', 'Asia', 'Oceania']
quadrant = [None, 'NE', 'SE', 'SW', 'NW']
languages = [None, 'English', 'Spanish', 'French', 'German', 'Slavic', 'Other Indo-European', 'Chinese', 'Arabic', 'Japanese/Turkish/Finnish/Magyar', 'Others']
religions = ['Catholic', 'Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Others']
for i in range(len(data)):
    for j in range(len(data[i])):
        # Country Name
        if (j == 0):
            data[i][j] = str(filedata[i][j])
        # Landmass
        elif (j == 1):
            data[i][j] = landmass[filedata[i][j]]
        elif (j  == 2):
            data[i][j] = quadrant[filedata[i][j]]
        elif (j  == 5):
            data[i][j] = languages[filedata[i][j]]
        elif (j == 6):
            data[i][j] = religions[filedata[i][j]]
        else:
            data[i][j] = filedata[i][j]
        # Make the row into a numpy array
        data[i] = np.array(data[i])

# Transpose so that features are along the rows and data points are along the columns
data = np.array(data).transpose()

print (data.shape)

(30, 194)


### Cleaning the Dataset
The point of this problem is to only use the data and features that we can get from a given countries flag. This dataset includes features such as population, density etc. that are not related to the flag, and should be removed.
One of these, religion, will be our label that we are aiming to predict based off of the flag. Thus we will have to extract the religion feature as a label, and eliminate the non-flag related features.

In [2]:
# Extract the religions as the labels, row 6
names = data[0]
labels = data[6] 
data = np.delete(data, 6, axis=0)

# Extract the non flag related data
print (data[0:6])
for i in range(6):
    data = np.delete(data, 0, axis=0)
    
print (data.shape)

[['Afghanistan' 'Albania' 'Algeria' ... 'Zaire' 'Zambia' 'Zimbabwe']
 ['Asia' 'Europe' 'Africa' ... 'Africa' 'Africa' 'Africa']
 ['NE' 'NE' 'NE' ... 'SE' 'SE' 'SE']
 [648 29 2388 ... 905 753 391]
 [16 3 20 ... 28 6 8]
 ['Others' 'Other Indo-European' 'Arabic' ... 'Others' 'Others' 'Others']]
(23, 194)


### Analysing our Cleaned Data

With our data cleaned and prepared for analysis, we can start our analysis. Our goal is to build a classifier taht based on these features about a country's flag, we can clasify that country's major religion. It may be interesting to see what would happen if we used another feature as a target, but for this project we will be solely be focused on a classifier that focuses on religion.
These are the 3 methods that will be used, along with the group member responsible for that method:
- Classification Method 1 - Brooks Tawil
- Naive Bayes - Jack Chiu
- Classification Method 3 - Gavin Mckim

# Naive Bayes

In [3]:
def uNiQuE(vec):
    popCtr = 0
    for p in vec:
        if p in vec[:popCtr]:
            vec = np.delete(vec, popCtr, 0)
            popCtr -= 1
        popCtr += 1
        
    return vec   

def kFold(data, labels, kFolds):
    #shuffle
    inds = np.random.choice(np.arange(len(data)), len(data))
    data[:] = data[inds]
    labels[:] = labels[inds]

    startInd = 0
    stepSize = int(len(data)/kFolds)
    Acc = []
    predictions = []
    for i in range(kFolds):
        if i != kFolds-1:
            testData = data[startInd:startInd+stepSize]
            testLabels = labels[startInd:startInd+stepSize]
            trainingData = data[:startInd]
            trainingData = np.concatenate((trainingData, data[startInd+stepSize:]))
            trainingLabels = labels[:startInd]
            trainingLabels = np.concatenate((trainingLabels,labels[startInd+stepSize:]))
        else:
            testData = data[startInd:]
            testLabels = labels[startInd:]
            trainingData = data[:startInd]
            trainingLabels = labels[:startInd]

        startInd += stepSize
        temp, pList = calcErrNaive(trainingData, trainingLabels, testData, testLabels)
        Acc.append(temp)
        predictions.extend(pList)
    
    return Acc, labels, predictions
        

def trainNaive(data, labels):
    unique, counts = np.unique(labels, return_counts=True)
    prior = counts
    prior = (prior+0.0)/len(data)
    
    conditional = np.zeros((8, len(data[0]), 50))
    #conditional = (labels, feature, values in feature)
    for i in range(len(data)):
        for j in range(len(data[i])):
            conditional[labels[i], j, data[i,j]] += 1
    for i in range(len(conditional)):
        for j in range(len(conditional[0])):
            sumCondition = 0
            for k in range(0,conditional.shape[0]):
                sumCondition += sum(conditional[k,j,:])
            for k in range(0,len(conditional[0,j])):
                conditional[i,j,k] = conditional[i,j,k]/sumCondition
    

    return prior, conditional, unique


def testNaive(prior, conditional, unique, sample):
    prob = prior
    for i in range(len(sample)):
        for j in range(len(prob)):
            prob[j] = prob[j] * conditional[j,i,sample[i]]
    maxVal = np.argmax(prob)
        
    return unique[maxVal]
        
def calcErrNaive(trainingData, trainingLabels, testData, testLabels):
    errs = 0
    prior, conditional, unique = trainNaive(trainingData, trainingLabels)
    pList = []
    for i in range(len(testData)):
        prediction = testNaive(prior, conditional, unique, testData[i])
        errs += int(prediction == testLabels[i])
        pList.append(prediction)
        
    return np.round(errs/len(testLabels), 6), pList


def preProcess(data):
    i = 0
    for q in range(len(data)):
        if None in data[i]:
            data = np.delete(data, i, 0)
            i -= 1        
        i += 1
        for j in range(len(data[i-1])):
            if type(data[i-1,j]) == str:
                data[i-1,j] = data[i-1,j].lower()

    domColor = uNiQuE(data[:,10])
    topLeftColor = uNiQuE(data[:,-2])
    botRightColor = uNiQuE(data[:,-1])
    numStars = np.array([6, 5, 4, 3, 2, 1, 0])
    for i in range(len(data)):
        tempInd = np.where(domColor == data[i,10])
        data[i,10] = int(tempInd[0])
        tempInd = np.where(topLeftColor == data[i,-2])
        data[i,-2] = int(tempInd[0])
        tempInd = np.where(botRightColor == data[i,-1])
        data[i,-1] = int(tempInd[0])
        
        #make stars be in range [0 to >5]
        tempInd = np.where(numStars <= data[i,15])
        data[i,15] = int(tempInd[0][0])
    
    for i in range(data.shape[0]):
        for j in range(data.shape[1]):
            data[i,j] = int(data[i,j])
    return data

tempData = preProcess(data.T)
religions = uNiQuE(labels)
tempLabels = labels[:]
for i in range(len(labels)):
    tempInd = np.where(religions == labels[i])
    tempLabels[i] = int(tempInd[0])
    
for i in np.arange(3,13,2):
    acc, actual, predictions = kFold(tempData, tempLabels, i)
    print('k = ', i, ' with average accuracy = ' , np.average(acc).round(6))
    print('Accuracy for each fold: ', acc)

k =  3  with average accuracy =  0.185448
Accuracy for each fold:  [0.1875, 0.171875, 0.19697]
k =  5  with average accuracy =  0.185464
Accuracy for each fold:  [0.210526, 0.105263, 0.236842, 0.184211, 0.190476]
k =  7  with average accuracy =  0.209491
Accuracy for each fold:  [0.222222, 0.148148, 0.111111, 0.222222, 0.222222, 0.259259, 0.28125]
k =  9  with average accuracy =  0.234025
Accuracy for each fold:  [0.142857, 0.095238, 0.333333, 0.190476, 0.190476, 0.428571, 0.380952, 0.190476, 0.153846]
k =  11  with average accuracy =  0.261141
Accuracy for each fold:  [0.411765, 0.235294, 0.176471, 0.294118, 0.117647, 0.117647, 0.294118, 0.294118, 0.294118, 0.470588, 0.166667]


The Naive Bayes was created to classify from the data what type of religion the country with that flag would have. Evidently, from running the classifier with k-fold cross validation over various k values, the accuracy of the classifier was pretty low $\leq 40\%$. It can be said that there is a pretty weak relation between the country's flag features/data and the majority religion of that country based on the Naive Bayes Classifier.

In [4]:
acc, actual, predictions = kFold(tempData, tempLabels, 11)
print('Actual Religions and Predicted Religion for first 20 samples: ')
for i in range(20):
    if actual[i] == 2:
        print('Actual: ', religions[actual[i]], '\t Predicted:', religions[predictions[i]])
    elif actual[i] == 6:
        print('Actual: ', religions[actual[i]], '\t \t \t Predicted:', religions[predictions[i]])
    else:
        print('Actual: ', religions[actual[i]], '\t \t Predicted:', religions[predictions[i]])

Actual Religions and Predicted Religion for first 20 samples: 
Actual:  Other Christian 	 Predicted: Other Christian
Actual:  Other Christian 	 Predicted: Other Christian
Actual:  Marxist 	 	 Predicted: Muslim
Actual:  Ethnic 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Marxist 	 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Marxist 	 	 Predicted: Muslim
Actual:  Catholic 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Catholic 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Ethnic 	 	 Predicted: Muslim
Actual:  Catholic 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Other Christian
Actual:  Catholic 	 	 Predicted: Muslim
Actual:  Ethnic 	 	 Predicted: Muslim


Based on the Naive Bayes Classifier, the classifier tends to classify the flags as majority Muslim religion. Thus, based on Naive Bayes, there seems to not be much relation between a flag's image and its country's religion.