# **Artificial Intelligence - MSc**

## CS6512 - AI & Data Science Ecosystems - Theory and Practice 
## SEM2 2023/4

### CS6512 Etivity 2 - Implementing a Layer Detector for Cryptocurrency Rates with AWS

### Instructor: Emil Vassev
May  2nd, 2024
<br><br>
Copyright (C) 2023 - All rights reserved, do not copy or distribute without permission of the author.
***

## Task
In this e-tivity, you will be granted with access to an AWS Academy Learner Lab, and asked to:
<ol>
<li>Follow the provided instructions to set up your AWS environment.</li>
<li>Implement an Outlier Detector for cryptocurrency rates by using the outlier-detection algorithms covered in class:
<ul>
<li>Enhanced Dixon Q</li>
<li>Mean & Standard Deviation</li>
<li>Isolation Forest</li>
<li>Boxplots Method</li>
<li>DBSCAN Clustering Method</li>
</ul></li>    
</ol>

Your outlier detector shall be implemented in Python and run on AWS SageMaker. 

### Implementation Subtasks
This e-tivity has two distinct subtasks:
<ol>
    <li>Detecting outliers among cryptocurrency history rates</li>
    <li>Detecting outliers among cryptocurrency live-exchange rates</li>
</ol>    

### Subtask #1: Detecting outliers among cryptocurrency history rates
<ol>
<li>Use the AWS SageMaker to open a Jupyter Notebook and implement your assignment there.</li> 
<li>Set up an S3 bucket, upload the **instrument_price.csv** file and use the provided interface to read from and write to this and othere files there.</li> 
<li>Implement an outlier detector that runs on the AWS SageMaker and:
  <ul>
    <li>loads csv data from the AWS S3 storage space and produces an outlier report</li>
    <li>gets hystory rates from the marketplace CryptoCompare.com and produces an outlier report</li>
  </ul>
</li>
</ol>

### Subtask #2: Detecting outliers among cryptocurrency live-exchange rates
<ol>
<li>Implement a new feature of your Outlier Detector, so it will:
  <ol>
    <li>Get live-exchange cryptocurrency rates from the marketplace CryptoCompare.com on every 30 sec.</li>
    <li>Store these cryptocurrency rates.</li>
    <li>Detect outliers among these cryptocurrency rates.</li>
  </ol>
</li>
</ol>


## Note 
<span style="color:blue">You will be provided with:</span> 
  <ul>
    <li><span style="color:blue">implementation of the outlier-detection algorithms:</span>
      <ul>  
       <li>class DixonQEnhanced</li>
       <li>class StandardDeviationMethod</li>
       <li>class IsolationForestMethod</li>
       <li>class BoxPlotsMethod</li>
       <li>class DBScanClusteringMethod</li>
      </ul>    
    </li>
    <li><span style="color:blue">implementation of reading live-exchange rates from CryptoCompare.com:</span>
      <ul>  
       <li>class CryptoCompareReader</li>
      </ul>  
    </li>
    <li><span style="color:blue">a library making the communication with the s3 bucket transparent:</span>
      <ul>  
       <li>class S3Utils</li>
      </ul>  
    </li>
    <li><span style="color:blue">implementation of the structure of your code (class and methods) - you will need to follow this structure:</span>
      <ul>  
       <li>class CS6512Assignment2</li>
      </ul> 
    </li> 
  </ul>    

## Implementation 

### Libraries to Use

#### Method #1: Dixon Q Test

See the "Outlier Detection Techniques with Python" practical lesson. 

In [1]:
"""
Enhanced Dixon Q Test

@author: Emil Vassev
"""

from scipy.stats import shapiro

"""
 *
 * This class implements an enhanced version of DixonQ Test.
 * Provides a set of encoded critical values - up to 100.
 * The encoded critical values are used as a basis to generate critical values for other alphas (levels of confidence).
 * Both encoded and generated critical values are used to produce a result of maximum accuracy when identifying outliers. 
 *  
""" 
class DixonQEnhanced:
    
    criticalValues = {}


    """
     * DixonQEnhanced constructor
    """ 
    def __init__(self):
        self.buildCriticalValues()


    """
     * Builds a dictionary of critical values grouped by alpha  
    """
    def buildCriticalValues(self):
        
        """
         * the critical values are grouped by an alpha key
         * alpha is the probability of incorrectly rejecting the suspected outlier
        """    
        #encoded critical values for alpha = 0.3 (0.7% level of confidence)
        self.criticalValues[0.30] = [0,0,
                                     0.6836,0.4704,0.3730,0.3173,0.2811,0.2550,0.2361,0.2208,
                                     0.2086,0.1983,0.1898,0.1826,0.1764,0.1707,0.1656,0.1613,
                                     0.1572,0.1535,0.1504,0.1474,0.1446,0.1420,0.1397,0.1376,
                                     0.1355,0.1335,0.1318,0.1300,0.1283,0.1268,0.1255,0.1240,
                                     0.1227,0.1215,0.1202,0.1192,0.1181,0.1169,0.1160,0.1153,
                                     0.1141,0.1134,0.1124,0.1116,0.1108,0.1102,0.1093,0.1087,
                                     0.1079,0.1071,0.1067,0.1060,0.1052,0.1047,0.1041,0.1036,
                                     0.1030,0.1024,0.1019,0.1014,0.1009,0.1004,0.1000,0.0997,
                                     0.0991,0.0987,0.0982,0.0979,0.0974,0.0970,0.0967,0.0961,
                                     0.0960,0.0955,0.0952,0.0948,0.0943,0.0939,0.0937,0.0935,
                                     0.0930,0.0928,0.0925,0.0921,0.0918,0.0915,0.0913,0.0910,
                                     0.0906,0.0903,0.0902,0.0899,0.0896,0.0894,0.0892,0.0890,
                                     0.0887,0.0885]
        
        #encoded critical values for alpha = 0.2 (0.8% level of confidence)
        self.criticalValues[0.20] = [0,0,
                                     0.7808,0.5603,0.4508,0.3868,0.3444,0.3138,0.2915,0.2735,
                                     0.2586,0.2467,0.2366,0.2280,0.2202,0.2137,0.2077,0.2023,
                                     0.1973,0.1929,0.1890,0.1854,0.1820,0.1790,0.1761,0.1735,
                                     0.1710,0.1687,0.1664,0.1645,0.1624,0.1604,0.1590,0.1571,
                                     0.1555,0.1540,0.1525,0.1512,0.1499,0.1484,0.1472,0.1462,
                                     0.1449,0.1441,0.1430,0.1418,0.1408,0.1400,0.1390,0.1381,
                                     0.1374,0.1365,0.1357,0.1349,0.1340,0.1334,0.1326,0.1320,
                                     0.1312,0.1304,0.1299,0.1294,0.1286,0.1281,0.1275,0.1272,
                                     0.1264,0.1260,0.1254,0.1249,0.1243,0.1238,0.1234,0.1228,
                                     0.1225,0.1221,0.1217,0.1212,0.1205,0.1201,0.1198,0.1195,
                                     0.1189,0.1187,0.1182,0.1178,0.1174,0.1171,0.1167,0.1165,
                                     0.1160,0.1156,0.1154,0.1151,0.1147,0.1144,0.1141,0.1138,
                                     0.1134,0.1131]        

        #encoded critical values for alpha = 0.1 (0.9% level of confidence)
        self.criticalValues[0.10] = [0,0,
                                     0.8850,0.6789,0.5578,0.4840,0.4340,0.3979,0.3704,0.3492,
                                     0.3312,0.3170,0.3045,0.2938,0.2848,0.2765,0.2691,0.2626,
                                     0.2564,0.2511,0.2460,0.2415,0.2377,0.2337,0.2303,0.2269,
                                     0.2237,0.2208,0.2182,0.2155,0.2132,0.2110,0.2088,0.2066,
                                     0.2045,0.2026,0.2008,0.1993,0.1974,0.1958,0.1944,0.1930,
                                     0.1915,0.1902,0.1890,0.1875,0.1865,0.1850,0.1839,0.1829,
                                     0.1819,0.1808,0.1797,0.1788,0.1777,0.1768,0.1759,0.1752,
                                     0.1741,0.1733,0.1726,0.1717,0.1707,0.1703,0.1694,0.1689,
                                     0.1679,0.1674,0.1667,0.1660,0.1652,0.1648,0.1641,0.1635,
                                     0.1631,0.1626,0.1620,0.1613,0.1605,0.1601,0.1596,0.1594,
                                     0.1586,0.1583,0.1576,0.1573,0.1567,0.1563,0.1557,0.1554,
                                     0.1547,0.1544,0.1540,0.1537,0.1532,0.1528,0.1524,0.1521,
                                     0.1516,0.1512]        

        #encoded critical values for alpha = 0.05 (0.95% level of confidence)
        self.criticalValues[0.05] = [0,0,
                                     0.9411,0.7651,0.6423,0.5624,0.5077,0.4673,0.4363,0.4122,
                                     0.3922,0.3755,0.3615,0.3496,0.3389,0.3293,0.3208,0.3135,
                                     0.3068,0.3005,0.2947,0.2895,0.2851,0.2804,0.2763,0.2725,
                                     0.2686,0.2655,0.2622,0.2594,0.2567,0.2541,0.2513,0.2488,
                                     0.2467,0.2445,0.2423,0.2408,0.2383,0.2366,0.2350,0.2334,
                                     0.2319,0.2302,0.2288,0.2273,0.2257,0.2241,0.2228,0.2216,
                                     0.2206,0.2191,0.2182,0.2169,0.2160,0.2145,0.2135,0.2126,
                                     0.2116,0.2106,0.2095,0.2085,0.2075,0.2070,0.2057,0.2053,
                                     0.2045,0.2037,0.2030,0.2020,0.2013,0.2005,0.1996,0.1990,
                                     0.1984,0.1980,0.1973,0.1964,0.1955,0.1950,0.1943,0.1940,
                                     0.1934,0.1927,0.1922,0.1918,0.1909,0.1906,0.1899,0.1896,
                                     0.1887,0.1885,0.1881,0.1876,0.1869,0.1865,0.1860,0.1856,
                                     0.1851,0.1846]        

        #encoded critical values for alpha = 0.02 (0.98% level of confidence)
        self.criticalValues[0.02] = [0,0,
                                     0.9763,0.8457,0.7291,0.6458,0.5864,0.5432,0.5091,0.4813,
                                     0.4591,0.4405,0.4250,0.4118,0.3991,0.3883,0.3792,0.3711,
                                     0.3630,0.3562,0.3495,0.3439,0.3384,0.3328,0.3287,0.3242,
                                     0.3202,0.3163,0.3127,0.3093,0.3060,0.3036,0.2999,0.2973,
                                     0.2948,0.2921,0.2898,0.2879,0.2853,0.2836,0.2815,0.2794,
                                     0.2778,0.2758,0.2744,0.2726,0.2711,0.2690,0.2676,0.2662,
                                     0.2651,0.2632,0.2620,0.2606,0.2595,0.2582,0.2570,0.2555,
                                     0.2545,0.2531,0.2522,0.2510,0.2500,0.2493,0.2480,0.2472,
                                     0.2466,0.2457,0.2445,0.2436,0.2429,0.2420,0.2409,0.2402,
                                     0.2398,0.2387,0.2382,0.2372,0.2365,0.2360,0.2349,0.2345,
                                     0.2337,0.2330,0.2322,0.2319,0.2309,0.2304,0.2298,0.2294,
                                     0.2285,0.2279,0.2272,0.2272,0.2259,0.2257,0.2251,0.2247,
                                     0.2240,0.2234]        

        #encoded critical values for alpha = 0.01 (0.99% level of confidence)
        self.criticalValues[0.01] = [0,0,
                                     0.9881,0.8886,0.7819,0.6987,0.6371,0.5914,0.5554,0.5260,
                                     0.5028,0.4831,0.4664,0.4517,0.4385,0.4268,0.4166,0.4081,
                                     0.4002,0.3922,0.3854,0.3789,0.3740,0.3674,0.3625,0.3583,
                                     0.3543,0.3499,0.3460,0.3425,0.3390,0.3357,0.3323,0.3294,
                                     0.3266,0.3238,0.3213,0.3187,0.3163,0.3141,0.3124,0.3102,
                                     0.3081,0.3061,0.3050,0.3028,0.3009,0.2991,0.2972,0.2960,
                                     0.2941,0.2927,0.2920,0.2899,0.2880,0.2873,0.2859,0.2845,
                                     0.2828,0.2816,0.2812,0.2792,0.2784,0.2775,0.2766,0.2754,
                                     0.2742,0.2735,0.2724,0.2714,0.2709,0.2696,0.2682,0.2677,
                                     0.2667,0.2662,0.2656,0.2646,0.2637,0.2633,0.2621,0.2614,
                                     0.2608,0.2599,0.2588,0.2584,0.2573,0.2568,0.2566,0.2558,
                                     0.2548,0.2543,0.2539,0.2535,0.2524,0.2521,0.2512,0.2513,
                                     0.2499,0.2498]   
        
        #encoded critical values for alpha = 0.005 (0.995% level of confidence)
        self.criticalValues[0.005] = [0,0,
                                     0.9940,0.9201,0.8234,0.7437,0.6809,0.6336,0.5952,0.5668,
                                     0.5416,0.5208,0.5034,0.4869,0.4739,0.4614,0.4504,0.4423,
                                     0.4333,0.4247,0.4173,0.4109,0.4051,0.3986,0.3935,0.3889,
                                     0.3843,0.3801,0.3762,0.3718,0.3685,0.3646,0.3610,0.3583,
                                     0.3548,0.3522,0.3498,0.3465,0.3443,0.3415,0.3400,0.3377,
                                     0.3353,0.3332,0.3325,0.3298,0.3279,0.3256,0.3235,0.3225,
                                     0.3204,0.3191,0.3177,0.3163,0.3140,0.3136,0.3118,0.3098,
                                     0.3089,0.3075,0.3071,0.3061,0.3041,0.3031,0.3025,0.3006,
                                     0.2996,0.2990,0.2983,0.2968,0.2959,0.2946,0.2934,0.2932,
                                     0.2922,0.2912,0.2905,0.2897,0.2885,0.2876,0.2870,0.2859,
                                     0.2852,0.2844,0.2836,0.2832,0.2818,0.2811,0.2808,0.2798,
                                     0.2790,0.2788,0.2784,0.2775,0.2766,0.2764,0.2755,0.2751,
                                     0.2738,0.2737]   

        """
         * Generates all critical values by using the encoded values as a basis.
         * Values are genereated between any two existing pairs of alphas.
        """ 
        #generate range alpha 0.2 - 0.1
        self.generateCriticalValuesForAlphaPair(0.2,0.1)

        #generate range alpha 0.3 - 0.2
        self.generateCriticalValuesForAlphaPair(0.3,0.2)

        #generate range alpha 0.10 - 0.05
        self.generateCriticalValuesForAlphaPair(0.10,0.05)

        #generate range alpha 0.05 - 0.02
        self.generateCriticalValuesForAlphaPair(0.05,0.02)
        
        
    """
     * Generates the missing series of critical values between two alphas with a step = 0.01
     * constraint: alpha1 > alpha2
    """ 
    def generateCriticalValuesForAlphaPair(self, alpha1, alpha2):
        
        if alpha1 < alpha2:
            raise Exception('The value of alpha1 is less than alpha2.')
            
        nInsideAlphas = int(round((alpha1 - alpha2)/(0.01)) - 1)
        
        insideAlphas = []
        
        step = 0.01
        for i in range(1,nInsideAlphas+1):
            newAlpha = round(alpha2 + i*step,2)
            insideAlphas.append(newAlpha) 
        
        for index in range(2,100):

            rangeLeft = self.criticalValues[alpha1][index]
            rangeRight = self.criticalValues[alpha2][index]
        
            distance = round(((rangeRight - rangeLeft)/(nInsideAlphas+1)),4)
            
            currentValue = self.criticalValues[alpha1][index]
            
            for insideAlpha in insideAlphas:
                
                if insideAlpha not in self.criticalValues.keys():
                    self.criticalValues[insideAlpha] = []
                    self.criticalValues[insideAlpha].append(0)
                    self.criticalValues[insideAlpha].append(0)
                
                currentValue += distance
                
                currentValue = round(currentValue,4)
                
                self.criticalValues[insideAlpha].append(currentValue)      
                
    """
     * Finds the next element in a series of elements
    """
    def findNextInSeries(self, number, series):
        
        result = -1
        
        try:
            index = series.index(number)
        except ValueError as e:
            raise Exception('The number has not been found in the series.')

        if index == (len(series) - 1):
            result = index - 1
        else:
            result = index + 1

        return result

        
    """
     * Finds the previous element in a series of elements
    """
    def findPreviousInSeries(self, number, series):
        
        result = -1
        
        try:
            index = series.index(number)
        except ValueError as e:
            raise Exception('The number has not been found in the series.')

        if index == 0:
            result = index + 1
        else:
            result = index - 1

        return result

        
    """
     * Identifies if a number is outlier within a series and for particular alpha
    """
    def isOutlier(self, number, series, alpha):
                
        qCritical = 0.0
        
        qExpDivisor = series[len(series)-1] - series[0]
        
        if qExpDivisor == 0:
            return False
        
        if len(series) > 100:
            return False

        nextNumberGap = abs(number - series[self.findNextInSeries(number,series)])
        prevNumberGap = abs(number - series[self.findPreviousInSeries(number,series)])
        if prevNumberGap < nextNumberGap:
            closestNumberGap = prevNumberGap
        else:
            closestNumberGap = nextNumberGap
            
        qExp = closestNumberGap/qExpDivisor
        
        if alpha in self.criticalValues.keys():
            qCritical = self.criticalValues[alpha][len(series)-1]
            
        if qExp > qCritical:
            return True
        else:
            return False
        

    """
     * Identifies all the outliers within a series
     * Uses the isOutlier method
    """
    def findOutliers(self, series):
        
        outliers = {}
        
        for alpha in self.criticalValues.keys():            
            for number in series:
                if self.isOutlier(number,series,alpha):
                    if number in outliers:
                        if outliers[number] < (1-alpha):
                            outliers[number] = (1-alpha)
                    else:
                        outliers[number] = (1-alpha)
                        
        return outliers
    
    
    """
     * Checks if the data set is normally distributed;
     * running DixonQ Test on different distributions will lead to erroneous results
     *
     * Runs a Shapiro-Wilk test to check if the series is Gaussian
    """    
    def checkForNormalDisribution(self, series):
        
        print("Shapiro-Wilk: Running Shapiro-Wilk test ....")
        
        stat, p = shapiro(series)
        
        alpha = 0.05
        
        if p > alpha:       
            print("Shapiro-Wilk: Series looks Gaussian")
            print("")
            return True

        else:
            print("Shapiro-Wilk: Series does not look Gaussian")
            print("")
            return False
        
        
    """
     * Executes DixonQ Test on the provided series of numbers;
     * DixonQ Test is executed for all available alpha keys (levels of confidence)
    """ 
    def execute(self, series):
        
        outliers = {}

        series.sort(reverse=False)
        
        if not self.checkForNormalDisribution(series):
            print("DixonQ Test: Warning: Test should not be run on a series that is not normally distributed.")

        outliers = self.findOutliers(series)
        
        return outliers        

#### Method #2: Mean & Standard Deviation

See the "Outlier Detection Techniques with Python" practical lesson. 

In [2]:
import numpy as np
#import matplotlib.pyplot as plt


"""
 * This class implements the Standard Deviation Method for detecting outliers
"""
class StandardDeviationMethod:
    
    
    methodName = "StandardDeviationMethod"
    
    upperLimit = 0.0
    lowerLimit = 0.0
    seriesStd = 0.0
    seriesMean = 0.0
    

    """
     *
    """ 
    def __init__(self):
        
        pass
    
    
    """
     *
    """ 
    def getMethodName(self):
        return "Standard Deviation Method"

        
    """
     * Function to detect outliers on one-dimentional datasets
    """
    def execute(self, series):
        
        outliers = []
    
        # set upper and lower limits to 3 times the standard deviation
        seriesStd = np.std(series)
        seriesMean = np.mean(series)
        #anomalyCutOff = seriesStd * 3
        #anomalyCutOff = seriesStd * 2
        #anomalyCutOff = seriesStd * 1.5
        #anomalyCutOff = seriesStd * 1.75
        anomalyCutOff = seriesStd * 2.5
        
        lowerLimit  = seriesMean - anomalyCutOff 
        upperLimit = seriesMean + anomalyCutOff
        
        #print(lowerLimit)
        
        self.upperLimit = upperLimit
        self.lowerLimit = lowerLimit
        self.seriesStd = seriesStd
        self.seriesMean = seriesMean

        # generate outliers
        for outlier in series:
            if outlier > upperLimit or outlier < lowerLimit:
                outliers.append(outlier)
                
        return outliers

#### Method #3: Isolation Forest

Isolation Forest - an **ensemble method** (similar to random forest), i.e., it use the average of the predictions by several decision trees when assigning the final anomaly score to a given data point. 

In [3]:
from sklearn.ensemble import IsolationForest
import pandas as pd

"""
 * This class implements the Isolation Forest Method for detecting outliers
"""
class IsolationForestMethod:

    """
     *
    """ 
    def __init__(self):       
        pass

    
    """
     *
    """ 
    def checkAllElementsEqual(self, series):
        return len(set(series)) <= 1
    

    """
     * Function to detect outliers on one-dimentional datasets
    """
    def execute(self, series):
        outliers = []
        
        if not self.checkAllElementsEqual(series):          
            df = pd.DataFrame({'temp':series})
            clf = IsolationForest().fit(df['temp'].values.reshape(-1, 1)) 
            outliersInds = clf.predict(df['temp'].values.reshape(-1, 1))
            
            for indx in range(0, len(outliersInds)):
                if outliersInds[indx] == -1:
                    outliers.append(series[indx])    
                    
        return outliers

#### Method #4: Boxplots

A **Box Plot** (or boxplot) is a method for graphically depicting groups of numerical data through their quartiles.

Boxplots - standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

- median (Q2/50th Percentile): the middle value of the dataset.
- first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.
- third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.
- InterQuartile Range (IQR): 25th to the 75th percentile. IQR tells how spread the middle values are.
- “maximum”: Q3 + 1.5*IQR
- “minimum”: Q1 -1.5*IQR
- Outliers: (shown as green circles) In statistics, an outlier is an observation point that is distant from other observations.

<div>
 <img src="attachment:image.png" width="350"/>
</div


In [4]:
import seaborn as sns
from matplotlib.cbook import boxplot_stats

"""
 * This class implements the Boxplots Method for detecting outliers
"""
class BoxPlotsMethod:
        
    methodName = "BoxPlotsMethod"
    
    
    """
     *
    """ 
    def __init__(self):
        pass


    """
     *
    """ 
    def getMethodName(self):
        return "Boxplots Method"

        
    """
     * Function to detect outliers on one-dimentional datasets
    """
    def execute(self, series):
        outliers = []

        ax = sns.boxplot(data=series, whis=2.5)
        
        outliers = [y for stat in boxplot_stats(series) for y in stat['fliers']]

        return outliers

#### Method #5:  DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
- a clustering algorithm commonly used for outlier detection in machine learning
- works based on the density of points in a given dataset

The algorithm starts by selecting a random point from the dataset and then it finds all the points that are within a specified radius (**eps**) from this point. If the number of points within this radius is greater than or equal to a specified minimum number of points (**min_samples**), then a cluster is formed. The algorithm continues to find all the neighboring points of every point in the cluster, forming larger clusters until there are no more points left to add.

Points that are not included in any cluster are considered outliers. The algorithm labels these points as noise, since they do not belong to any cluster.

In [5]:
from sklearn.cluster import DBSCAN
import pandas as pd

"""
 * This class implements the DBScan Clustering Method for detecting outliers
"""
class DBScanClusteringMethod:

    
    methodName = "DBScanClusteringMethod"


    """
     *
    """ 
    def __init__(self):
        
        pass


    """
     *
    """ 
    def checkAllElementsEqual(self, series):
        return len(set(series)) <= 1


    """
     *
    """ 
    def getMethodName(self):
        return "DBScan Clustering Method"

        
    """
     * Function to detect outliers on one-dimentional datasets
    """
    def execute(self, series):
        outliers = []

        if not self.checkAllElementsEqual(series):
            
            df = pd.DataFrame({'temp':series})
            outliersDetection = DBSCAN(min_samples = 2, eps = 0.5)
            outliersInds = outliersDetection.fit_predict(df['temp'].values.reshape(-1, 1))
            
            for indx in range(0, len(outliersInds)):

                if outliersInds[indx] == -1:
                    outliers.append(series[indx])    

        return outliers

#### CryptoCompare Reader

In [10]:
"""
@author: Emil Vassev
"""



import requests


"""
CryptoCompare.com Reader class - uses CryptoCompare.com API to retrieve history data
"""
class CryptoCompareReader:
    
    apiKey = "fe6382d7770ad0c939c5c12d51e76ab772afbc361f2900405fe8bc930e31ed97"
    urlCurrent = "https://min-api.cryptocompare.com/data/pricemulti?fsyms=$1&tsyms=USD&api_key=" + apiKey
    urlHistory = "https://min-api.cryptocompare.com/data/v2/histoday?fsym=$1&tsym=USD&limit=$2"

    
    def __init__(self):
        pass
 
    
    def extractCoinRates(self, apiResult):
        usdToCoinRates = []
        
        data = apiResult.get("Data").get("Data")
        
        for cryptoCurrency in data:
            coinResult = cryptoCurrency["close"]
            usdToCoinRates.append(coinResult)

        return usdToCoinRates

    
    def readHistoryRates (self, cryptoCurrency, size):
        
        urlRestAPI = self.urlHistory.replace("$1", cryptoCurrency)

        urlRestAPI = urlRestAPI.replace("$2", size)
        
        response = requests.get(urlRestAPI)
        
        return self.extractCoinRates(response.json())
    
    
    def readCurrentRate (self, cryptoCurrency):
        
        urlRestAPI = self.urlCurrent.replace("$1", cryptoCurrency)

        response = requests.get(urlRestAPI).json()
        
        coinRate = response[cryptoCurrency].get('USD')
        
        return coinRate 

#### AWS S3 Utils

In [11]:
"""
@author: Emil Vassev
"""

import boto3
import pandas as pd
import os
from io import StringIO


bucket_name = 'cs6512'
AWS_S3_BUCKET = bucket_name
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")


def read_csv_file_from_s3(csv_file_name):
    
    client = boto3.client('s3')

    object_key = csv_file_name

    csv_obj = client.get_object(Bucket=AWS_S3_BUCKET, Key=object_key)
    body = csv_obj['Body']
    csv_string = body.read().decode('utf-8')

    df = pd.read_csv(StringIO(csv_string))

    return df


def write_csv_file_to_s3(csv_file_name, df):

    client = boto3.client(
                              "s3", 
                              aws_access_key_id=AWS_ACCESS_KEY_ID, 
                              aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
                              aws_session_token=AWS_SESSION_TOKEN
                            )
    
    with StringIO() as csv_buffer:
        df.to_csv(csv_buffer, index=False)

        response = client.put_object(Bucket=AWS_S3_BUCKET, Key=csv_file_name, Body=csv_buffer.getvalue())

    status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")

    if status == 200:
        print(f"Successful S3 put_object response. Status - {status}")
    else:
        print(f"Unsuccessful S3 put_object response. Status - {status}")

ModuleNotFoundError: No module named 'boto3'

### Your Implementation

Implement your solution by following the structure of the <i>CS6512Assignment2</i> class.  

In [None]:
"""
@author: put your name and id
"""

#add your imports here


class CS6512Assignment2:

    sereisCSV = []
    seriesCrptCmpr = []    

    
    #describes a series and saves the result to a file
    def describeSereis(self, series, fileName):
        
        #add your code here        
        
        return True
    
    
    #extracts a cryptocurrency series from a provided dataframe
    def extractSeriesFromDF(self, df, cryptoCurrency):
        
        series = []
        
        #add your code here 
        
        return series


    #gets a cryptocurrency series from CryptoCompare.com 
    def extractSereisFromCryptoCompare(self, cryptoCurrency):

        series = []
        
        #add your code here 
        #use the CryptoCompareReader class to get a hystory of 600 rates
        
        return series
        

    #extracts 100 elements from a series at random 
    def extract100ElementsAtRandom(self, series):

        series100 = []
        
        #add your code here 
     
        return series100


    #executes the Dixon Q method  
    def executeDixonQ(self, series):

        outliers = []
        
        #add your code here 
     
        return outliers


    #executes the Standard Deviation method
    def executeStDevition(self, series):

        outliers = []
        
        #add your code here 
     
        return outliers
    

    #executes the Isolation Forest method
    def executeIsolationForest(self, series):

        outliers = []
        
        #add your code here 
     
        return outliers
    
    
    #executes the Boxplots method
    def executeBoxplots(self, series):

        outliers = []
        
        #add your code here 
     
        return outliers


    #executes the DBSCAN Clustering method
    def executeDBSCANClustering(self, series):

        outliers = []
        
        #add your code here 
     
        return outliers

    
    #records the joint-outliers results into your S3 storage
    #JSON format:
    # {
    #  "Series_100": [83.825, 84.715, 86.94, 88.1, 90.0, 90.365, 91.21, 92.16, 92.74, 94.0, 
    #                 94.31, 95.0, 95.37, 95.49, 96.315, 97.67, 98.805, 102.94, 108.46, 109.73, 
    #                 110.03, 110.42, 111.34, 111.89, 113.0, 113.25, 113.39, 113.88, 114.73, 
    #                 114.87, 115.22, 117.37, 122.57, 133.37, 133.995, 135.58, 136.26, 136.52, 
    #                 136.67, 136.82, 137.22, 138.18, 138.33, 140.57, 142.19, 142.4, 143.65, 
    #                 146.98, 147.27, 147.27, 147.89, 148.225, 148.31, 148.34, 148.435, 148.9, 
    #                 152.285, 154.34, 159.11, 159.89, 167.88, 169.0, 169.69, 169.775, 170.74, 
    #                 170.82, 171.17, 171.22, 171.24, 171.86, 171.88, 172.13, 172.39, 172.4, 173.96, 
    #                 174.67, 175.035, 175.4, 175.66, 175.84, 176.52, 178.37, 179.2, 179.47, 
    #                 179.79, 179.95, 181.94, 182.01, 188.0, 188.26, 188.4, 189.25, 191.04, 
    #                 192.53, 197.05, 198.28, 198.34, 198.69, 200.4, 235.18], 
    #  "Dixon_Q": [235.18], 
    #  "Stnadard_Deviation": [235.18], 
    #  "Isolation_Forest": [83.825, 84.715, 86.94, 88.1, 90.0, 90.365, 91.21, 92.16, 92.74, 96.315, 97.67, 
    #                       98.805, 102.94, 108.46, 117.37, 122.57, 154.34, 159.11, 159.89, 167.88, 181.94, 
    #                       182.01, 188.0, 189.25, 191.04, 192.53, 197.05, 198.28, 198.34, 198.69, 200.4, 235.18], 
    #  "Boxplots": [40.0, 60.0, 63.0, 63.0, 64.0, 65.0, 169.114286, 169.429444, 172.353, 197.4606, 235.18],
    #  "DBSCAN_Clustering": [40, 60, 64, 65, 72.5126, 75.671004, 79.782605, 91.300919, 92, 94.931864, 96, 
    #                        103.824548, 104.851577, 105.948, 109.593951, 113.338919, 114.672941, 116.48158, 
    #                        118.7118, 122.25, 122.873307, 124.339492, 125.154034, 136.782077, 143.9881, 
    #                        146.902942, 156.476319, 159.2281, 164.957759, 167.574109, 172.353, 197.4606, 235.18],
    #  "Joint_Outliers": [235.18] 
    # }
    def produceJsonOutliers(self, json_file_name, series_100, 
                            outliers_DQ, outliers_StD, outliers_IF, outliers_BXPLTS, 
                            outliers_DBSCAN_CL):
        
        #add your code here 
            
        return True    

        
    #produces a joint-outliers result out of the five series
    #use majority voting - at least 3 methods need to confirm the outliers
    def produceJointResultOutOfAll(self, outliers1, outliers2, outliers3, outliers4, outliers5):
        joint_outliers = []
        
        #add your code here 
     
        return joint_outliers
  
    
    #executes an outliers detection phase
    #possible phases: 
    # phase_1 - uses the provided CSV file to detect outliers 
    # phase_2 - uses the hystory rates extracted from CryptoCompare.com to detect outliers
    # phase_3 - uses live-exchange rates provided from CryptoCompare.com to detect outliers
    def executePhase(self, phase_num, series):

        #add your code here 
            
        return True    

    
    #entry point for the one-shot execution
    def execute(self, s3_df, crypto_currency):
    
        #phase #1
        series = self.extractSeriesFromDF(s3_df, crypto_currency)
        self.executePhase("phase_1", series) 
        
        #phase #2
        series = self.extractSereisFromCryptoCompare(crypto_currency)
        self.executePhase("phase_2", series) 
            
        return True   

    
    #entry point for the time-based execution
    def executeOnTimer(self, crypto_currency):
        
        #phase #3
        series = []

        #add your code here 

        self.executePhase("phase_3", series) 
            
        return True   


### Pseudo code for executePhase() method

    def executePhase(self, phase_num, series):
        while (True):
            
            #step #1: exit on series size < 3
            
            #step #2: if series size > 100 then create a series of 100 elements selected at random from series
        
            #step #3: joint_outliers_5 = "joint outliers of all 5 algorithms"      
        
            if "joint_outliers_5" is not empty:
                #step #4: self.produceJsonOutliers(arguments go here)

                break
            
            if phase_num == "phase_3":
                break

        return True 

### Pseudo code for executeOnTimer() method

    def executeOnTimer(self, crypto_currency):
    
        while (True):
        
            check if 30 seconds have elapsed: 
            
                reset the timer 
       
                #step #1: if the csv file with live-exchange cryptocurrency rates exists then load the rates from this file
                       
                #step #2: get the current live-exchange rate for crypto_currency by using the CryptoCompareReader class  
        
                #step #3: add live_rate to series
        
                #step #4: write the updated series into a csv file with live-exchange cryptocurrency rates
                               
                #step #5: self.executePhase("phase_3", series) 
                    
                break
            
        return True

### The extract100ElementsAtRandom() method
The Dixon Q works on series of elements with maximum size of 100 elements. Here if a series has a length larger than 100, we extract 100 elements at random. The 100-element series is processed by all the algorithms.  

## Testing

You are required to test your solution with the <span style="color:blue"><b>Solana</b></span> cryptocurrency (ticker = <span style="color:blue"><b>'SOL'</b></span>). Solana is the fastest blockchain in the world and the fastest growing ecosystem in cryptocurrency. Hence, it may have quite volatile exchange rates. 

## What to deliver?

You are asked to deliver:
<ul>
<li>your solution: implemented in this notebook</li>
<li>your result files: check the <b>results</b> directory to see the file names and their structure</li>
</ul>

## A glimpse of your initial data

In [1]:
import pandas as pd

df = pd.read_csv ('data\instrument_price.csv')

#df = read_csv_file_from_s3('data/instrument_price.csv')

df.head(10)


Unnamed: 0,instrument_ticker,currency_code,bid,offer,pricing_source_code,time
0,LINK,USD,18.08329,18.08329,KRAKEN,2/10/2022 6:31
1,BTC,USD,43821.375,43821.375,COINBASE,2/10/2022 6:31
2,LTC,USD,138.06,138.06,KRAKEN,2/10/2022 6:31
3,DAI,USD,1.0189,1.0189,BINANCE,2/10/2022 6:31
4,XRP,USD,0.8715,0.8715,BINANCE,2/10/2022 6:01
5,SOL,USD,111.415,111.415,COINBASE,2/10/2022 6:31
6,BTC,USD,43916.9,43916.9,KRAKEN,2/10/2022 6:01
7,XRP,USD,0.87188,0.87188,KRAKEN,2/10/2022 6:01
8,BTC,USD,43819.8,43819.8,KRAKEN,2/10/2022 6:31
9,USD,CAD,1.268,1.268,BANKOFCANADA,2/10/2022 5:31


In [13]:
dfCurrency = df.loc[df['instrument_ticker'] == 'SOL']

In [14]:
dfCurrency['offer'].describe()

count    11371.000000
mean       143.432384
std         34.494024
min         81.600000
25%        111.790000
50%        143.605000
75%        175.230000
max        235.180000
Name: offer, dtype: float64