## Data Import & EDA

In [1]:
import pandas as pd
import os
from functools import reduce

In [2]:
data = pd.read_csv('/Users/ariana/Desktop/investments_VC.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49438 entries, 0 to 49437
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   permalink             49438 non-null  object 
 1   name                  49437 non-null  object 
 2   homepage_url          45989 non-null  object 
 3   category_list         45477 non-null  object 
 4    market               45470 non-null  object 
 5    funding_total_usd    49438 non-null  object 
 6   status                48124 non-null  object 
 7   country_code          44165 non-null  object 
 8   state_code            30161 non-null  object 
 9   region                44165 non-null  object 
 10  city                  43322 non-null  object 
 11  funding_rounds        49438 non-null  int64  
 12  founded_at            38554 non-null  object 
 13  founded_month         38482 non-null  object 
 14  founded_quarter       38482 non-null  object 
 15  founded_year       

In [4]:
data.head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,...,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
0,/organization/waywire,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,News,1750000,acquired,USA,NY,New York City,...,0,0,0,0,0,0,0,0,0,0
1,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,...,0,0,0,0,0,0,0,0,0,0
2,/organization/rock-your-paper,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Publishing,40000,operating,EST,,Tallinn,...,0,0,0,0,0,0,0,0,0,0
3,/organization/in-touch-network,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|i...,Electronics,1500000,operating,GBR,,London,...,0,0,0,0,0,0,0,0,0,0
4,/organization/r-ranch-and-mine,-R- Ranch and Mine,,|Tourism|Entertainment|Games|,Tourism,60000,operating,USA,TX,Dallas,...,0,0,0,0,0,0,0,0,0,0


The dataset contains information about various companies, including details like funding rounds, status, industry markets, and more, spread across 39 columns. These are some key columns that could be useful for a recommendation system for investors:

1. name: The name of the company.
2. market: The industry or market the company operates in.
3. funding_total_usd: Total funding received, in USD.
4. status: Current status of the company (e.g., operating, acquired).
5. country_code: The country the company is based in.
6. funding_rounds: Number of funding rounds.
7. founded_at: The date the company was founded.
8. Specific funding round columns (seed, venture, equity_crowdfunding, etc.).


### Goal: Create a Content-Based Recommendation system. 

This recommends companies based on similarity to other companies that an investor has shown interest in. This could use features like market, funding rounds, and total funding.

Steps:

1. Data Preprocessing: Handle missing values, convert data types appropriately (especially the funding_total_usd which is a string), and possibly reduce the number of features based on relevance.
2. Feature Engineering: Create a profile for each company using relevant features. We may need to encode categorical data (like country and market) and normalize numerical values for better similarity computation.
3. Similarity Metric: Decide on a similarity metric (e.g., cosine similarity, Euclidean distance) to compare companies.
4. Recommendation Engine: Build the engine that computes similarity scores between companies and suggests similar companies based on a given input company.

## Data Preprocessing

In [5]:
#Correcting the column names by removing any leading or trailing spaces
data.columns = data.columns.str.strip()
data['funding_total_usd'] = data['funding_total_usd'].str.replace(',', '').str.replace('-', '0').astype(float)

#Fill missing values
data['market'] = data['market'].fillna('Unknown')

#Drop rows with missing 'country_code'
data = data.dropna(subset=['country_code'])

#Normalize numerical data
data['funding_rounds'] = (data['funding_rounds'] - data['funding_rounds'].mean()) / data['funding_rounds'].std()
data['funding_total_usd'] = (data['funding_total_usd'] - data['funding_total_usd'].mean()) / data['funding_total_usd'].std()

#Convert to categorical codes
data['country_code'] = data['country_code'].astype('category').cat.codes
data['market'] = data['market'].astype('category').cat.codes


data[['market', 'funding_total_usd', 'funding_rounds', 'country_code']].head(), data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44165 entries, 0 to 49437
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   permalink             44165 non-null  object 
 1   name                  44165 non-null  object 
 2   homepage_url          41424 non-null  object 
 3   category_list         41332 non-null  object 
 4   market                44165 non-null  int16  
 5   funding_total_usd     44165 non-null  float64
 6   status                43057 non-null  object 
 7   country_code          44165 non-null  int8   
 8   state_code            30161 non-null  object 
 9   region                44165 non-null  object 
 10  city                  43322 non-null  object 
 11  funding_rounds        44165 non-null  float64
 12  founded_at            35519 non-null  object 
 13  founded_month         35451 non-null  object 
 14  founded_quarter       35451 non-null  object 
 15  founded_year          35

(   market  funding_total_usd  funding_rounds  country_code
 0     447          -0.076622       -0.560182           110
 1     264          -0.062747        0.186496           110
 2     524          -0.087166       -0.560182            35
 3     201          -0.078163       -0.560182            38
 4     660          -0.087043        0.186496           110,
 None)

## Content-Based Recommendation Algorithm

In [6]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

#Feature matrix for similarity calculation
feature_matrix = data[['market', 'funding_total_usd', 'funding_rounds', 'country_code']].values

#Cosine similarity matrix
similarity_matrix = cosine_similarity(feature_matrix)

In [None]:
#Function to recommend companies based on similarity scores
def recommend_companies(company_name, data, similarity_matrix, top_n=5):
    #Check if the company exists 
    if company_name not in data['name'].values:
        return "Company not found in the dataset."

    #Find the index of the company
    index = data[data['name'] == company_name].index[0]

    #Get similarity scores for the company with all other companies
    similarity_scores = list(enumerate(similarity_matrix[index]))

    #Sort the companies based on the similarity scores in descending order
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    #Get the scores of the top 'n' most similar companies
    top_companies_indices = [i[0] for i in similarity_scores[1:top_n+1]]  # skip self (index 0)

    #Get the names of the top similar companies
    top_companies = data.iloc[top_companies_indices]['name'].tolist()

    return top_companies

#Test the recommendation system with an example company from the dataset
test_company = data['name'].iloc[0]
recommend_companies(test_company, data, similarity_matrix)


In [None]:
test_company

## Clustering

Clustering can be used to segment companies into similar groups based on their features. Investors can be recommended companies from clusters that have historically seen success or growth. 

**Step 1: Determining the Optimal Number of Clusters**

Use the elbow method to determine the optimal number of clusters by plotting the sum of squared distances from each point to its assigned center as we change the number of clusters.

**Step 2: K-Means Clustering**

Apply the K-means algorithm to segment the companies.

**Step 3: Analyze Cluster Output**

Examine the characteristics of each cluster to understand what types of companies are grouped together.

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

#Elbow Method
sse = [] 
range_clusters = range(1, 11)

for k in range_clusters:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(feature_matrix)
    sse.append(kmeans.inertia_)

# Plot the SSE to find the elbow point
plt.figure(figsize=(10, 6))
plt.plot(range_clusters, sse, marker='o')
plt.title('Elbow Method to Determine Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Errors')
plt.grid(True)
plt.show()

In [None]:
#Applying K-means clustering with 4 clusters
kmeans_final = KMeans(n_clusters=4, random_state=42)
kmeans_final.fit(feature_matrix)


data['cluster'] = kmeans_final.labels_

#Analyzing the characteristics of each cluster
cluster_summary = data.groupby('cluster').agg({
    'market': ['count', lambda x: x.value_counts().index[0]], 
    'funding_total_usd': ['mean', 'std'],  
    'funding_rounds': ['mean', 'std'],     
    'country_code': [lambda x: x.value_counts().index[0]] 
}).rename(columns={'<lambda_0>': 'most_common'})

cluster_summary

## Neural Network Model

We can use a simple neural network to:

1. Predict company success metrics (like funding rounds or growth indicators) based on their features. This can help in recommending companies that are predicted to perform well.
2. Deep Collaborative Filtering (if we had user preferences or investor interaction data), to predict investor-company affinity based on learned embeddings.

**Step 1: Model Architecture**

Build a simple neural network using TensorFlow/Keras. The model will consist of:
- An input layer.
- A couple of hidden layers.
- An output layer with one neuron (since we're doing regression on total funding).

**Step 2: Training the Model**

Train the model using an appropriate loss function (mean squared error for regression) and evaluate its performance on the test set.

**Step 3: Using the Model for Recommendations**

Use the model's predictions to recommend companies with the highest predicted funding.

In [None]:
from sklearn.model_selection import train_test_split

#Feature and target variable setup
X = data[['market', 'funding_rounds', 'country_code']]  #Using these features for prediction
y = data['funding_total_usd']  #Target variable

#Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


In [None]:
pip install tensorflow

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(64, activation='relu', input_shape=(3,)),  
    Dense(64, activation='relu'),
    Dense(1) 
])

model.compile(optimizer='adam', loss='mean_squared_error')


In [None]:
history = model.fit(X_train, y_train, epochs=100, validation_split=0.2)

In [None]:
test_loss = model.evaluate(X_test, y_test)
print('Test Loss:', test_loss)

In [None]:
predicted_funding = model.predict(X) 

In [None]:
data['predicted_funding'] = predicted_funding.flatten()  

In [None]:
recommended_companies = data.sort_values(by='predicted_funding', ascending=False)
top_companies = recommended_companies.head(10) 