## PEMILIHAN PORTFOLIO SAHAM LQ45 DENGAN ALGORITMA CLUSTRING

Ada 3 tahap yang akan dikerjakan 
1. Menentukan harga saham berdasarkan data lampau
2. Menentukan cluster saham yang akan membantu investor dalam menentukan portfolionya
3. Pemilihan cluster saham saham investor dengan menmasukan jenis saham yang diinginkan

Daftar Isi :

  #### 1. Mengambil Data dan Memuat Data
          1.1 Data dan Deskripsi Fitur
          1.2 Persiapan Pustaka /Library yang diguakan
  #### 2. Analilsis Data Eksploratori
  #### 3. Model Regresi Sebagai Representasi Saham
          3.1 Model Dasar Untuk Standarisasi dan Regresi                  
          3.2 Penyetelan Parameter Model : REgresi LInear dengan Cross Validasi                 
  ##### 4 Pengukuran Regresi
          4.1 MAE 
          4.2 RMSE            
  
  ##### 5  Klasifikasi KNN Pergerakan Saham            
  ##### 6  Model Klustering Saham            
  ##### 7. Interaksi Pemakai

### 1. Mengambil Data dan Memuat Data

#### 1.1 Data dan Deskripsi Data

Data diambil dari website bursa efek indonesia
https://www.idx.co.id/data-pasar/ringkasan-perdagangan/ringkasan-saham/
dengan kategori kelompok saham LQ-45

Semua saham memiliki kolom sebagai berikut :

* Date - tanggal dalam format : dd-mm-yyyy 
* Open - Harga pembukaan pada bursa efek indonesia BEI dalam rupiah (Rp.))
* High - Harga tertinggi yang tercapai pada hari itu
* Low  - Harga terendah yang teracapai pada hari itu
* close - Harga penutupan pada hari itu
* Volume - Jumlah saham yang ditransaksikan
* Name - kode saham

#### 1.2 Mempersiapkan library yang akan diguakan

In [1]:
import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
from pandas import read_csv

In [2]:
#Read the data
filename = '../input/lq45-ta-ori/LQ45_TA_ORI.csv'
stock = read_csv(filename)
print("***Structure of data with all its features***")
stock.head()

### 2. Analilsis Data Eksploratori

##### Mengecek jumlah recor saham dan fitur untuk contoh saham 'ANTM'

In [3]:
ticker_name = 'ANTM'
stock_a = stock[stock['Name'] == ticker_name]
stock_a.shape

##### Mengecek nilai null untuk persiapan pre prosesing data

In [4]:
stock.info()

##### Mengecek distribusi data untuk memehami variansi data sebagai persiapan scaling

In [5]:
stock_a.describe()

##### Menambahkan fitur baru untuk memahami variansi harga saham hari itu dan hari sebelumnya. fitur ini akan membantu dalam memprediksi harga penutupan lebih akurat juga untuk mebantu memodelkan volatilitas harga saham.

In [6]:
stock_a['changeduringday'] = ((stock['high'] - stock['low'] )/ stock['low'])*100

stock_a['changefrompreviousday'] = (abs(stock_a['close'].shift() - stock_a['close'] )/ stock['close'])*100

In [1]:
print("**Fitur baru 'changeduring day & change from previous day ditambahkan ke dataset. catatan: baris pertama untuk perubahan dar hari sebelumnya setiap stock NA atau selalu kosong")
stock_a.head()

 ##### Let's see graphical representation of the distribution of one stock

In [8]:
stock_a.hist(bins=50, figsize=(20,15))
plt.show()

##### Variansi pada harga penutupan saham ANTM

In [9]:
stock_a.plot(kind="line", x="date", y="close", figsize=(15, 10))

##### Membuat matrix korelasi untuk mengetahui korelasi antra harga penutupan (sebagai target) dan fitur lainnya

In [10]:
corr_matrix = stock_a.corr()

In [11]:
corr_matrix["close"].sort_values(ascending=False)

In [12]:
from pandas.plotting import scatter_matrix

attributes = ["high", "low", "open", "changefrompreviousday", "changeduringday", "volume"]

scatter_matrix(stock_a[attributes], figsize=(20, 15))

##### Heap map  untuk korelasi fitur

In [13]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
corr = stock_a[["high", "low", "open", "changefrompreviousday", "changeduringday", "volume"]].corr()

# generate a mask for the lower triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 12))

# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
            square=True, 
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax);

#### 3. Model Regresi Sebagai Representasi Saham

##### Membagi dataset dengan membagi saham ANTM kedalam data latih dan testing denga librari scikit-learn. Juga menghapus kolom 'date' dan 'name' dari dataset karena memiliki signifikansi yang lemah dalam memprediksi saham. Juga menghapus fitur close dari dataset input dan menambahkannya sebagai fitur target untuk mengukur akurasi prediksi.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,Normalizer
X_stock_a = stock_a.drop(['date', 'Name','close'], axis=1)
y_stock_a = stock_a['close']

X_stock_train, X_stock_test, y_stock_train, y_stock_test = train_test_split(X_stock_a, y_stock_a, test_size=0.2, random_state=42)

3.1 Model Dasar Untuk Standarisasi dan Regresi

Normalisasi Data dan Linear Regresi

In [15]:
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler,Normalizer

from sklearn.pipeline import Pipeline

Lr_pipeline_nor = Pipeline([
        ('imputer', Imputer(missing_values="NaN",strategy="median")), #Use the "median" to impute missing vlaues
        ('normalizer',Normalizer()),
        ('lr', LinearRegression())
        
    ])

Lr_pipeline_nor.fit(X_stock_train, y_stock_train)

In [16]:
#Data prep pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer,StandardScaler
data_pipeline = Pipeline([
        ('imputer', Imputer(missing_values="NaN",strategy="median")), #Use the "median" to impute missing vlaues
        ('scaler',StandardScaler())
#        ('normalizer', Normalizer()),
    ])



#### Model Dasar untuk standarisasi dan Regresi

In [17]:
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#Data prep pipeline

Lr_pipeline_std = Pipeline([
        ('imputer', Imputer(missing_values="NaN",strategy="median")), #Use the "median" to impute missing vlaues
        ('scaler',StandardScaler()),
        ('lr', LinearRegression())
        
    ])

Lr_pipeline_std.fit(X_stock_train, y_stock_train)



#### 3.2 Penyetelan Parameter Model : Regresi LInear dengan Cross Validasi    

In [18]:
#1. Fine tune Linear Regression using Random search
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn import cross_validation

lin_reg = LinearRegression()


X_stock_train = data_pipeline.fit_transform(X_stock_train)

scores = cross_validation.cross_val_score(lin_reg, X_stock_train, y_stock_train, scoring='neg_mean_squared_error', cv=10,)


#Metrics - Mean CV scores for Linear Regression
print (-scores)
print ('Mean score for Linear Regression: ', np.mean(-scores))

#### 4 Pengukuran Regresi

#### 4.1 MAE

In [19]:
from sklearn.metrics import mean_absolute_error


lr_stock_predictions_std = Lr_pipeline_std.predict(X_stock_test)
lr_mae_std = mean_absolute_error(y_stock_test, lr_stock_predictions_std)
print('Lr MAE with standardization', lr_mae_std)

#### 4.2 RMSE

In [20]:
import pandas as pd
import numpy as np

#Predict and report RMSE
from sklearn.metrics import mean_squared_error

from sklearn.metrics import mean_squared_error

lr_stock_predictions_std = Lr_pipeline_std.predict(X_stock_test)
lr_mse_std = mean_squared_error(y_stock_test, lr_stock_predictions_std)
lr_rmse_std = np.sqrt(lr_mse_std)
print('Lr RMSE with standardization', lr_rmse_std)

#### RMSE untuk LQ-45

In [21]:
from sklearn.preprocessing import Imputer
    
def allModelsResultForAllStocks():
    
    best_result_per_ticker = pd.DataFrame(columns=['Ticker','Model','RMSE'])
    ticker_list = np.unique(stock["Name"])
    best_result_per_ticker = list()
    for ticker_name in ticker_list:
        result = pd.DataFrame(columns=['Ticker','Model','RMSE'])
        stock_a = stock[stock['Name'] == ticker_name]
        #Adding new features 
        #1 Price movement during day time 
        stock_a['changeduringday'] = ((stock['high'] - stock['low'] )/ stock['low'])*100

        #2 Price movement 
        stock_a['changefrompreviousday'] = (abs(stock_a['close'].shift() - stock_a['close'] )/ stock['close'])*100

        X_stock_a = stock_a.drop(['date', 'Name','close'], axis=1)
        y_stock_a = stock_a['close']

        
        imputer = Imputer(missing_values='NaN', strategy='median')
        
        imputer.fit_transform(X_stock_a)
       
        X_stock_train, X_stock_test, y_stock_train, y_stock_test = train_test_split(X_stock_a, y_stock_a, test_size=0.2,random_state=42)


        Lr_pipeline_std.fit(X_stock_train, y_stock_train)
        ##Lr_pipeline_nor.fit(X_stock_train, y_stock_train)

       
   
        # Predict & Calculate RMSE for all the models 

        #Linear Regression with normalisation and standardisation
     
    
        lr_stock_predictions_std = Lr_pipeline_std.predict(X_stock_test)
        lr_mse_std = mean_squared_error(y_stock_test, lr_stock_predictions_std)
        lr_rmse_std = np.sqrt(lr_mse_std)
        rmse_row =   [ticker_name,'Lr RMSE with standardization', lr_rmse_std]    
    

        result.loc[-1] = rmse_row  # adding a row
        result.index = result.index + 1  # shifting index                    
       
        best_result_per_ticker.append(np.array(result.iloc[0, :]))      


    best_result_per_ticker_df = pd.DataFrame(data=best_result_per_ticker, columns=['Ticker','Model','RMSE'])
    
    
    return best_result_per_ticker_df

best_result_per_ticker = allModelsResultForAllStocks()

  ### 6 . Model Klasifikasi Saham

#### Disini kami akan menggunakan fitur model linear untuk menggunakan Kami mendarat di bagian kedua dari proyek kami, di sini kami akan menggunakan koefisien fitur model Linear dari semua saham S&P 500 dan kelas targetnya (logika homegrown untuk menetapkan kelas) untuk pelatihan dan validasi model klasifikasi. Setelah model klasifikasi siap, model ini akan digunakan untuk memprediksi kelas saham investor dan dengan cara ini model akan membantu investor untuk mendiversifikasi portofolionya. 

In [22]:
#Classification function homegrown logic based on stock price mean variation  
def classify (meanValue):
    if meanValue <=1.5:
        return 'Low'
    elif meanValue >1.5 and  meanValue <=2.5:
        return 'Medium'
    elif meanValue >2.5:
        return 'High'

In [23]:
#function to get linear model for given ticker 

def linearModel(ticker):
    stock_a = stock[stock['Name'] == ticker]
    #Adding new features 
    #1 Price movement during day time 
    stock_a['changeduringday'] = ((stock['high'] - stock['low'] )/ stock['low'])*100

    #2 Price movement 
    stock_a['changefrompreviousday'] = (abs(stock_a['close'].shift() - stock_a['close'] )/ stock['close'])*100

    X_stock_a = stock_a.drop(['date', 'Name','close'], axis=1)
    y_stock_a = stock_a['close']

    Lr_pipeline_std.fit(X_stock_a, y_stock_a)
    
    model = Lr_pipeline_std.named_steps['lr']
    
    return model,stock_a

In [24]:

#Menggunakan semua saham LQ-45 untuk data training 
ticker_list = np.unique(stock['Name'])

df = pd.DataFrame(columns=['TICKER','CLASS','Coef for open','Coef for high','Coef for low','Coef for volume','Coef for change within day','Coef for change from prev day'])
for ticker in ticker_list:
    
    model,stock_a = linearModel(ticker)    
    
    print("Mean value:",stock_a["changeduringday"].mean())
    #adding target class 
    stock_features = np.concatenate((np.asarray([ticker,classify(stock_a["changeduringday"].mean())]),model.coef_))
    
    df.loc[-1] = stock_features  # adding a row
    df.index = df.index + 1  # shifting index
    df = df.sort_index() 
   
#print(df)

##menyimpan koefisien fitur dab clas target saham-saham LQ-45 
df.to_csv('coefflq45.csv', mode='a',header=['TICKER','CLASS','Coef for open','Coef for high','Coef for low','Coef for volume','Coef for change within day','Coef for change from prev day'])
    


#### 5 Klasifikasi KNN Pergerakan Saham

##### Split the data into train and testing using Sklearn

In [25]:
# loading libraries
import numpy as np
from sklearn.cross_validation import train_test_split

X_class = np.array(df.ix[:, 2:8]) 
y_class = np.array(df['CLASS']) 


# split into train and test
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(X_class, y_class, test_size=0.2, random_state=42)

#####  Mencocokan dan memprediksi dengan Pengklasifikasi KNN dengan 3 Neighbor terdekat

In [26]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors=3)

# fitting the model
knn.fit(X_train_class, y_train_class)

# predict the response
pred = knn.predict(X_test_class)

# evaluate accuracy
print ("Akurasi KNN", accuracy_score(y_test_class, pred))

 ### 6 . Model Kluster Saham 

Sebelumnya kami menentukan class secara manual untuk melatih data dan menggunakan model klasifikasi untuk memprediksi 
class data testing.
Maka sekrang digunakan pembelajaran tanpa supervisi (metode klustering) untuk membagi saham kedalam kluster serupa dan
kemudian menggunakan model kluster untuk memprediksi cluster saham investor.



#### Model K-Means++ untuk menentukan jumlah cluster

In [27]:
from sklearn.cluster import KMeans

X_class = np.array(df.ix[:, 2:8]) 	# end index is exclusive
k_mean = KMeans()

#Jumlah Kluster akan ditentukan oleh K-NN, secara default 
k_mean_model = k_mean.fit(X_class)
print("Jumlah Kluster",k_mean_model.n_clusters) 

In [28]:

df_cluster = df.drop(['CLASS'], axis=1)

#Selecting features from dataframe , there are 6 features 
X_cluster = np.array(df_cluster.ix[:, 1:7])

y_pred = k_mean_model.predict(X_cluster)

pred_df = pd.DataFrame({'labels': y_pred, 'companies': df_cluster.ix[:, 0]})



In [29]:
#Penentuan Kluster Kelompok Saham LQ-45
pred_df

#### 7.  Interaksi Pemakai


In [39]:
#Memasukan kode saham pilihan 

stock_customer = input("Masukan kode saham yang akan dibeli (model akan mengelompokkannya) ?")
print(stock_customer)

#### Menggunakan Model KLasifikasi

In [40]:
customer_stock_model,stock_modified = linearModel(stock_customer)
customer_stock_class_pred = knn.predict([customer_stock_model.coef_])

In [41]:
print("Prediksi Class saham pilihan",customer_stock_class_pred)

#### Menggunakan model kluster

In [42]:
customer_stock_model,stock_modified = linearModel(stock_customer)

print(customer_stock_model.coef_)

customer_stock_class_pred = k_mean_model.predict([customer_stock_model.coef_])



In [43]:
print("Kluster Saham Investor:",stock_customer, " yaitu :",customer_stock_class_pred)