### Dataset 1 - Country Flags
The dataset below provides different information of a country in order to predict potentially what the majority religion is. This falls more in line with a clustering problem, whereby flags are clustered into groups.
https://archive.ics.uci.edu/ml/datasets/Flags?fbclid=IwAR3cI_9sS9XxKBJ-RPXEIAPBOS3QDqkS7qYxicM6F_TiJB--5P8r1Tt6Lxk

#### Problem
For this problem we contruct a classifier that will take in various attributes about a country's flag, an attempt to classify the majority religion of the country.

#### Extending the problem
With mre time and knowledge we could use more advanced ML algorithms to analyze this dataset. But in particiular, a more extensive knowledge in feature engineering would allow us to create our own features. The dataset is not insanly large, ~ 200 countries, so it is entirely within the realm of possibility to make our own features if we had more time and experience.

### Dataset Description

Source Information
   -- Creators: Collected primarily from the "Collins Gem Guide to Flags":
      Collins Publishers (1986).
   -- Donor: Richard S. Forsyth 
             8 Grosvenor Avenue
             Mapperley Park
             Nottingham NG3 5DX
             0602-621676
   -- Date: 5/15/1990
   
Number of Instances: 194

Number of attributes: 30 (overall)

Attribute Information:
   1. name	Name of the country concerned
   2. landmass	1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania
   3. zone	Geographic quadrant, based on Greenwich and the Equator 1=NE, 2=SE, 3=SW, 4=NW
   4. area	in thousands of square km
   5. population in round millions
   6. language 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others
   7. religion 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others
   8. bars     Number of vertical bars in the flag
   9. stripes  Number of horizontal stripes in the flag
  10. colours  Number of different colours in the flag
  11. red      0 if red absent, 1 if red present in the flag
  12. green    same for green
  13. blue     same for blue
  14. gold     same for gold (also yellow)
  15. white    same for white
  16. black    same for black
  17. orange   same for orange (also brown)
  18. mainhue  predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue,and if that fails the leftmost hue)
  19. circles  Number of circles in the flag
  20. crosses  Number of (upright) crosses
  21. saltires Number of diagonal crosses
  22. quarters Number of quartered sections
  23. sunstars Number of sun or star symbols
  24. crescent 1 if a crescent moon symbol present, else 0
  25. triangle 1 if any triangles present, 0 otherwise
  26. icon     1 if an inanimate image present (e.g., a boat), otherwise 0
  27. animate  1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise
  28. text     1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise
  29. topleft  colour in the top-left corner (moving right to decide tie-breaks)
  30. botright Colour in the bottom-left corner (moving left to decide tie-breaks)

### Loading the Dataset
This dataset is stored on a single csv file across all of the features. Numpy can easily load in these values int oa matrix so that we can use it in our analysis.

In [7]:
import numpy as np
import operator

filedata = np.genfromtxt('./data/CountryFlags/flag.data', dtype=None, delimiter=',', encoding='utf-8')

data = [[None for _ in range(len(filedata[0]))] for _ in range(len(filedata))]

# Data is stored as mostly integers, but these correspond to a string in the data description, stored in these lists 
landmass = [None, 'N.America', 'S.America', 'Europe', 'Africa', 'Asia', 'Oceania']
quadrant = [None, 'NE', 'SE', 'SW', 'NW']
languages = [None, 'English', 'Spanish', 'French', 'German', 'Slavic', 'Other Indo-European', 'Chinese', 'Arabic', 'Japanese/Turkish/Finnish/Magyar', 'Others']
religions = ['Catholic', 'Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Others']
for i in range(len(data)):
    for j in range(len(data[i])):
        # Country Name
        if (j == 0):
            data[i][j] = str(filedata[i][j])
        # Landmass
        elif (j == 1):
            data[i][j] = landmass[filedata[i][j]]
        elif (j  == 2):
            data[i][j] = quadrant[filedata[i][j]]
        elif (j  == 5):
            data[i][j] = languages[filedata[i][j]]
        elif (j == 6):
            data[i][j] = religions[filedata[i][j]]
        else:
            data[i][j] = filedata[i][j]
        # Make the row into a numpy array
        data[i] = np.array(data[i])

# Transpose so that features are along the rows and data points are along the columns
data = np.array(data).transpose()

print (data.shape)

(30, 194)


### Cleaning the Dataset
The point of this problem is to only use the data and features that we can get from a given countries flag. This dataset includes features such as population, density etc. that are not related to the flag, and should be removed.
One of these, religion, will be our label that we are aiming to predict based off of the flag. Thus we will have to extract the religion feature as a label, and eliminate the non-flag related features.

In [8]:
# Extract the religions as the labels, row 6
names = data[0]
labels = data[6] 
data = np.delete(data, 6, axis=0)

# Extract the non flag related data
print (data[0:6])
for i in range(6):
    data = np.delete(data, 0, axis=0)

[['Afghanistan' 'Albania' 'Algeria' ... 'Zaire' 'Zambia' 'Zimbabwe']
 ['Asia' 'Europe' 'Africa' ... 'Africa' 'Africa' 'Africa']
 ['NE' 'NE' 'NE' ... 'SE' 'SE' 'SE']
 [648 29 2388 ... 905 753 391]
 [16 3 20 ... 28 6 8]
 ['Others' 'Other Indo-European' 'Arabic' ... 'Others' 'Others' 'Others']]


### Analysing our Cleaned Data

With our data cleaned and prepared for analysis, we can start our analysis. Our goal is to build a classifier that based on these features about a country's flag, we can classify that country's major religion. It may be interesting to see what would happen if we used another feature as a target, but for this project we will be solely be focused on a classifier that focuses on religion.
These are the 3 methods that will be used, along with the group member responsible for that method:
- K-Nearest Neighbors - Brooks Tawil
- Naive Bayes - Jack Chiu
- Kernal SVM - Gavin Mckim

## K-Nearest Neighbors - Brooks Tawil

In [11]:
import math

def uNiQuE(vec):
    popCtr = 0
    for p in vec:
        if p in vec[:popCtr]:
            vec = np.delete(vec, popCtr, 0)
            popCtr -= 1
        popCtr += 1
        
    return vec 

def preProcess(data):
    i = 0
    for q in range(len(data)):
        if None in data[i]:
            data = np.delete(data, i, 0)
            i -= 1        
        i += 1
        for j in range(len(data[i-1])):
            if type(data[i-1,j]) == str:
                data[i-1,j] = data[i-1,j].lower()

    domColor = uNiQuE(data[:,10])
    topLeftColor = uNiQuE(data[:,-2])
    botRightColor = uNiQuE(data[:,-1])
    numStars = np.array([6, 5, 4, 3, 2, 1, 0])
    for i in range(len(data)):
        tempInd = np.where(domColor == data[i,10])
        data[i,10] = int(tempInd[0])
        tempInd = np.where(topLeftColor == data[i,-2])
        data[i,-2] = int(tempInd[0])
        tempInd = np.where(botRightColor == data[i,-1])
        data[i,-1] = int(tempInd[0])
        
        #make stars be in range [0 to >5]
        tempInd = np.where(numStars <= data[i,15])
        data[i,15] = int(tempInd[0][0])
    
    for i in range(data.shape[0]):
        for j in range(data.shape[1]):
            data[i,j] = int(data[i,j])
    return data

cleanData = preProcess(data.T)
religions = uNiQuE(labels)

print (cleanData.shape)
print (cleanData)
print (religions)

(194, 23)
[[0 3 5 ... 0 0 0]
 [0 0 3 ... 0 1 1]
 [2 0 3 ... 0 2 2]
 ...
 [0 0 4 ... 0 2 0]
 [3 0 4 ... 0 2 7]
 [0 7 5 ... 0 2 0]]
['Muslim' 'Marxist' 'Other Christian' 'Catholic' 'Ethnic' 'Buddhist'
 'Hindu' 'Others']


We will use the Euclidean Distance as the distance metric. other distance metrics, such as Manhattan Distance, do exist and could also be used.

In [12]:
def euclideanDistance(trainingInstance, testInstance, length):
    distance = 0.0    
    for x in range(length):
        distance += np.square(trainingInstance[x] - testInstance[x])
    
    return math.sqrt(distance)

# Defining our KNN model
def knn(trainingSet, testInstance, k):
    # The training set will have the labels appended to the end, so that sorting can be done with the labels attached
    distances = {}
    sort = {}
 
    length = len(testInstance)
    
    # Calculating euclidean distance between each row of training data and test data
    for x in range(len(trainingSet)):
        distances[x] = euclideanDistance(trainingSet[x], testInstance.T, length)
          
    # Sorting them on the basis of distance
    sorted_d = sorted(distances.items(), key=operator.itemgetter(1))
 
    # Extracting top k neighbors
    neighbors = []
    for x in range(k):
        neighbors.append(sorted_d[x])
    classVotes = {}
    
    # Calculating the most freq class in the neighbors
    for x in range(len(neighbors)):
        response = trainingSet[neighbors[x][0]][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    
    # Return as tuple (Class, List Of Neighbors)
    return(sortedVotes[0][0], neighbors)

### Leave-One-Out Cross Validation

For this dataset, we don't have that many country flags to use. We only have 194 country flags to work with! In addition, the use of kNN means that we especially want to have a large training set, so that our distances between our training and test instances are not that far from the expected reality, with holes in our training. In addition, our dataset is relatively small, so with a modern system the computation will not take long.

In [15]:
# Run KNN with Leave-One-Out Cross Validation
kRange = range(1, 12)
missClasses = [0 for _ in range(len(kRange))]
correctClasses = [0 for _ in range(len(kRange))]
accuracies = [0 for _ in range(len(kRange))]

# Leave-One-Out CV, changing the k in kNN
for k in kRange:
    for i in range(len(cleanData)):
        # Remake our dataWithLabels
        dataWithLabels = list(cleanData.T)
        np.array(dataWithLabels.append(labels.T))
        dataWithLabels = np.array(dataWithLabels).T
        
        # Assign a single test instance
        testInstance = cleanData[i-1]
        testInstanceLabel = labels[i-1]

        # We will only delete from dataWithLabels
        dataWithLabels = np.delete(dataWithLabels, i-1, axis=0)
        
        # Run knn
        assignment, neighbors = knn(dataWithLabels, testInstance, k)
        
        # Check if the assignemnt is correct
        if (assignment == testInstanceLabel):
            correctClasses[k-1] += 1
        else:
            missClasses[k-1] += 1

print (missClasses)
print (correctClasses)

# Compute the accuracies for each fold and report
for i in range(len(accuracies)):
    accuracies[i] = correctClasses[i]/float(correctClasses[i] + missClasses[i])
    
print (accuracies)
best_k = list(kRange)[np.argmax(accuracies)]

print ('Best K: ', best_k)

[136, 136, 120, 115, 112, 114, 110, 114, 114, 115, 114]
[58, 58, 74, 79, 82, 80, 84, 80, 80, 79, 80]
[0.29896907216494845, 0.29896907216494845, 0.38144329896907214, 0.4072164948453608, 0.422680412371134, 0.41237113402061853, 0.4329896907216495, 0.41237113402061853, 0.41237113402061853, 0.4072164948453608, 0.41237113402061853]
Best K:  7


In [37]:
# Example for a single run with  as the single testing point 
print ('Example run for ', names[46])

# Remake our dataWithLabels
dataWithLabels = list(cleanData.T)
np.array(dataWithLabels.append(labels.T))
dataWithLabels = np.array(dataWithLabels).T

# Assign a single test instance
testInstance = cleanData[46]
testInstanceLabel = labels[46]

print ('Denmark has a majority religion of: ', testInstanceLabel)

# We will only delete from dataWithLabels
dataWithLabels = np.delete(dataWithLabels, i-1, axis=0)

# Run knn
assignment, neighbors = knn(dataWithLabels, testInstance, best_k)

# Check if the assignemnt is correct
if (assignment == testInstanceLabel):
    print ('Denmark has been successfully classified as: ', testInstanceLabel)
else:
    print ('Denmark has been unsuccessfully classified as: ', assignment, ' when it is actually: ', testInstanceLabel)
    
print (neighbors)
print ('Similar countries are:')
# Print the neighbors
for i in range(len(neighbors)):
    print (names[neighbors[i][0]])

Example run for  Denmark
Denmark has a majority religion of:  Other Christian
Denmark has been successfully classified as:  Other Christian
[(45, 0.0), (165, 0.0), (129, 1.4142135623730951), (174, 1.7320508075688772), (91, 2.0), (105, 2.0), (116, 2.0)]
Similar countries are:
Czechoslovakia
Sweden
North-Yemen
Tunisia
Jordan
Malaysia
Montserrat


### Conclusions

Ultimately this classifier does not reach an accuracy beyond 45%. Using the KNN algorithm we were expecting better results. Howvever, when looking at a run against a single country, and then looking at samples of flags, its no surprise that the accuracy is that low for this particiular algorithm. Countries often share characteristics like color across a wide range of symbols and meaning, not necessarily religious. There does not appear to be a string linke between most country's flags and the majority religion. In addition, there is room for confusion caused differentiating countries across certain religious lines. For example there are a lot of similarities between countries labeled as 'Catholic' and 'Other Christian' but an incorrect classification may still be reached.

## Naive Bayes - Jack Chiu

In [9]:
def uNiQuE(vec):
    popCtr = 0
    for p in vec:
        if p in vec[:popCtr]:
            vec = np.delete(vec, popCtr, 0)
            popCtr -= 1
        popCtr += 1
        
    return vec   

def kFold(data, labels, kFolds):
    #shuffle
    inds = np.random.choice(np.arange(len(data)), len(data))
    data[:] = data[inds]
    labels[:] = labels[inds]

    startInd = 0
    stepSize = int(len(data)/kFolds)
    Acc = []
    predictions = []
    for i in range(kFolds):
        if i != kFolds-1:
            testData = data[startInd:startInd+stepSize]
            testLabels = labels[startInd:startInd+stepSize]
            trainingData = data[:startInd]
            trainingData = np.concatenate((trainingData, data[startInd+stepSize:]))
            trainingLabels = labels[:startInd]
            trainingLabels = np.concatenate((trainingLabels,labels[startInd+stepSize:]))
        else:
            testData = data[startInd:]
            testLabels = labels[startInd:]
            trainingData = data[:startInd]
            trainingLabels = labels[:startInd]

        startInd += stepSize
        temp, pList = calcErrNaive(trainingData, trainingLabels, testData, testLabels)
        Acc.append(temp)
        predictions.extend(pList)
    
    return Acc, labels, predictions
        

def trainNaive(data, labels):
    unique, counts = np.unique(labels, return_counts=True)
    prior = counts
    prior = (prior+0.0)/len(data)
    
    conditional = np.zeros((8, len(data[0]), 50))
    #conditional = (labels, feature, values in feature)
    for i in range(len(data)):
        for j in range(len(data[i])):
            conditional[labels[i], j, data[i,j]] += 1
    for i in range(len(conditional)):
        for j in range(len(conditional[0])):
            sumCondition = 0
            for k in range(0,conditional.shape[0]):
                sumCondition += sum(conditional[k,j,:])
            for k in range(0,len(conditional[0,j])):
                conditional[i,j,k] = conditional[i,j,k]/sumCondition
    

    return prior, conditional, unique


def testNaive(prior, conditional, unique, sample):
    prob = prior
    for i in range(len(sample)):
        for j in range(len(prob)):
            prob[j] = prob[j] * conditional[j,i,sample[i]]
    maxVal = np.argmax(prob)
        
    return unique[maxVal]
        
def calcErrNaive(trainingData, trainingLabels, testData, testLabels):
    errs = 0
    prior, conditional, unique = trainNaive(trainingData, trainingLabels)
    pList = []
    for i in range(len(testData)):
        prediction = testNaive(prior, conditional, unique, testData[i])
        errs += int(prediction == testLabels[i])
        pList.append(prediction)
        
    return np.round(errs/len(testLabels), 6), pList


def preProcess(data):
    i = 0
    for q in range(len(data)):
        if None in data[i]:
            data = np.delete(data, i, 0)
            i -= 1        
        i += 1
        for j in range(len(data[i-1])):
            if type(data[i-1,j]) == str:
                data[i-1,j] = data[i-1,j].lower()

    domColor = uNiQuE(data[:,10])
    topLeftColor = uNiQuE(data[:,-2])
    botRightColor = uNiQuE(data[:,-1])
    numStars = np.array([6, 5, 4, 3, 2, 1, 0])
    for i in range(len(data)):
        tempInd = np.where(domColor == data[i,10])
        data[i,10] = int(tempInd[0])
        tempInd = np.where(topLeftColor == data[i,-2])
        data[i,-2] = int(tempInd[0])
        tempInd = np.where(botRightColor == data[i,-1])
        data[i,-1] = int(tempInd[0])
        
        #make stars be in range [0 to >5]
        tempInd = np.where(numStars <= data[i,15])
        data[i,15] = int(tempInd[0][0])
    
    for i in range(data.shape[0]):
        for j in range(data.shape[1]):
            data[i,j] = int(data[i,j])
    return data

tempData = preProcess(data.T)
religions = uNiQuE(labels)
tempLabels = labels[:]
for i in range(len(labels)):
    tempInd = np.where(religions == labels[i])
    tempLabels[i] = int(tempInd[0])
    
for i in np.arange(3,13,2):
    acc, actual, predictions = kFold(tempData, tempLabels, i)
    print('k = ', i, ' with average accuracy = ' , np.average(acc).round(6))
    print('Accuracy for each fold: ', acc)

k =  3  with average accuracy =  0.201389
Accuracy for each fold:  [0.203125, 0.234375, 0.166667]
k =  5  with average accuracy =  0.275689
Accuracy for each fold:  [0.184211, 0.210526, 0.263158, 0.315789, 0.404762]
k =  7  with average accuracy =  0.304729
Accuracy for each fold:  [0.185185, 0.37037, 0.407407, 0.222222, 0.407407, 0.259259, 0.28125]
k =  9  with average accuracy =  0.275336
Accuracy for each fold:  [0.190476, 0.333333, 0.333333, 0.380952, 0.142857, 0.142857, 0.428571, 0.333333, 0.192308]
k =  11  with average accuracy =  0.36943
Accuracy for each fold:  [0.411765, 0.470588, 0.235294, 0.294118, 0.294118, 0.529412, 0.352941, 0.294118, 0.588235, 0.176471, 0.416667]


The Naive Bayes was created to classify from the data what type of religion the country with that flag would have. Evidently, from running the classifier with k-fold cross validation over various k values, the accuracy of the classifier was pretty low $\leq 40\%$. It can be said that there is a pretty weak relation between the country's flag features/data and the majority religion of that country based on the Naive Bayes Classifier.

In [10]:
acc, actual, predictions = kFold(tempData, tempLabels, 11)
print('Actual Religions and Predicted Religion for first 20 samples: ')
for i in range(20):
    if actual[i] == 2:
        print('Actual: ', religions[actual[i]], '\t Predicted:', religions[predictions[i]])
    elif actual[i] == 6:
        print('Actual: ', religions[actual[i]], '\t \t \t Predicted:', religions[predictions[i]])
    else:
        print('Actual: ', religions[actual[i]], '\t \t Predicted:', religions[predictions[i]])

Actual Religions and Predicted Religion for first 20 samples: 
Actual:  Buddhist 	 	 Predicted: Catholic
Actual:  Muslim 	 	 Predicted: Catholic
Actual:  Ethnic 	 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Catholic 	 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Buddhist 	 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Buddhist 	 	 Predicted: Muslim
Actual:  Other Christian 	 Predicted: Muslim
Actual:  Muslim 	 	 Predicted: Muslim
Actual:  Catholic 	 	 Predicted: Other Christian
Actual:  Ethnic 	 	 Predicted: Muslim


Based on the Naive Bayes Classifier, the classifier tends to classify the flags as majority Muslim religion. Thus, based on Naive Bayes, there seems to not be much relation between a flag's image and its country's religion.

# Kernal SVM - Gavin McKim

In [117]:
from sklearn import svm
sdata = data[0:21,:]
sdata = np.delete(sdata, 10, 0)
#Following the preprocessing done by Brooks, I did a little of my own preprocessing. I just got rid of all the features
# that don't have numerical values.

#print(np.shape(sdata))
#print(sdata)
shfl = np.vstack((sdata,np.reshape(labels,[1,194])))
shfl = np.random.permutation(shfl.T).T
sdata = shfl[0:20,:]
labels = shfl[20,:]

#print(shfl)


#print(sdata[8,:])
G = np.arange(.01,1,0.01)

#for c in C:
means = []
for g in G:
    #The Kernel used is Radia Bias Kernel. I have a loop to find the optimal value of the tuning parameter G(or gamma in
    # the SVC call. I have decided to not tune the penalty parameter C because it makes the runtime too long on my computer).
    clf = svm.SVC(C=1, kernel='rbf', gamma=g)

    acc = []


    # 5-Fold Cross Validation
    for i in range(5):
        if i == 0:
            testing = sdata[:,0:39]
            training = sdata[:,39:194]
            yf = labels[39:194]
            yt = labels[0:39]
        if i == 1:
            testing = sdata[:,39:78]
            training = np.concatenate((sdata[:,0:39],sdata[:,78:194]), axis=1)
            yf = np.concatenate((labels[0:39],labels[78:194]), axis=0)
            yt = labels[39:78]
        if i == 2:
            testing = sdata[:,78:117]
            training = np.concatenate((sdata[:,0:78],sdata[:,117:194]), axis=1)
            yf = np.concatenate((labels[0:78],labels[117:194]), axis=0)
            yt = labels[78:117]
        if i == 3:
            testing = sdata[:,117:156]
            training = np.concatenate((sdata[:,0:117],sdata[:,156:194]), axis=1)
            yf = np.concatenate((labels[0:117],labels[156:194]), axis=0)
            yt = labels[117:156]
        if i == 4:
            testing = sdata[:,156:194]
            training = sdata[:,0:156]
            yf = labels[0:156]
            yt = labels[156:194]

        clf.fit(training.T, yf)
        acc.append(clf.score(testing.T,yt))
    means.append(np.mean(acc))

opt = np.argmax(means)

In [118]:
print("The highest accuracy is:",means[opt])
print("This accuracy is achieved by having a gamma term of:", (opt/100)+.01)

The highest accuracy is: 0.3137651821862348
This accuracy is achieved by having a gamma term of: 0.02
