# Where's the Best Place to Sell Books Online?
In 2018, e-commerce startup Flipkart was acquired for $18B by Wal-Mart.
It had taken over the online consumer market in India, and beat out the world's largest online business, Amazon.

It is commonly believed that startups can beat out larger businesses through differentiation, or being unique. Can a machine show us the differences between books being sold on Amazon, vs. Flipkart?

# Setup for Data Analysis and ML Classifications

In [3]:
# Exploratory Data Analysis libraries
import pandas as pd
import numpy as np
# visualizations
import seaborn as sns 
import matplotlib.pyplot as plt  
%matplotlib inline  # allows Matplotlib graphs to run inline in notebook

# Machine Learning libraries 
import sklearn
from sklearn.model_selection import train_test_split  # function used in building ML models

### Instantiating pandas DataFrames

In [4]:
amazon_df = pd.read_csv('Data/amazon.csv')
fp_df = pd.read_csv('Data/flipkart.csv')
# View the data for book listings on Amazon
amazon_df.head()

Unnamed: 0,amazon_title,amazon_author,amazon_rating,amazon_reviews count,amazon_isbn-10,amazon_price
0,Tell Me your Dreams,by Sidney Sheldon,4.4,160.0,8172234902,209
1,The Boy in the Striped Pyjamas (Definitions),by John Boyne,4.6,134.0,1862305277,350
2,Romancing the Balance Sheet: For Anyone Who Ow...,by Anil Lamba,4.5,156.0,9350294311,477
3,Mossad,by Michael Bar-Zohar - Import,4.6,637.0,8184958455,340
4,My Story,by Kamala Das,4.5,42.0,8172238975,178


In [5]:
# do the same for Flipkart books sold
fp_df.head()

Unnamed: 0,flipkart_author,flipkart_isbn10,flipkart_title,flipkart_ratings count,flipkart_price,flipkart_stars
0,Sidney Sheldon,8172234902,TELL ME YOUR DREAMS,902,209,4.5
1,,1862305277,The Boy in the Striped Pyjamas,83,372,4.5
2,Anil Lamba,9350294311,ROMANCING THE BALANCE SHEET,352,477,4.5
3,Bar-Zohar Michael,8184958455,Mossad,560,280,4.5
4,Kamala Das,8172238975,MY STORY,322,178,4.3


### Fixing Data Inconsistencies

Several problems with the data have been identified, as it pertains to this analysis:

1. Columns are named differently across DataFrames
2. The columns are ordered differently across DataFrames
3. Books with the same title have different casing across DataFrames
4. Data type for ratings count is all floats on Amazon data, yet integers for the analogous column on the Flipkart Dataframe.
5. ISBN columns offers seeminly useless information.
6. Different kinds of NaN values exist in the author columns on both DataFrames (i.e. "By NA", "Not Available", or just leaving the cell empty).

In [23]:
def cutoff_prefix(df, prefix_length):
    '''Removes the first prefix_length characters from each column in a DataFrame.'''
    col_names = list(df.columns)
    for i in range(len(col_names)):
        # slicing off the first 7 letters
        label = col_names[i]
        col_names[i] = label[prefix_length:]
        # swap out the names in the df for their shortened versions
        df.rename(columns={df.columns[i]: col_names[i]})
    return df

cutoff_prefix(amazon_df, 7)

Unnamed: 0,amazon_title,amazon_author,amazon_rating,amazon_reviews count,amazon_isbn-10,amazon_price
0,Tell Me your Dreams,by Sidney Sheldon,4.4,160.0,8172234902,209
1,The Boy in the Striped Pyjamas (Definitions),by John Boyne,4.6,134.0,1862305277,350
2,Romancing the Balance Sheet: For Anyone Who Ow...,by Anil Lamba,4.5,156.0,9350294311,477
3,Mossad,by Michael Bar-Zohar - Import,4.6,637.0,8184958455,340
4,My Story,by Kamala Das,4.5,42.0,8172238975,178
...,...,...,...,...,...,...
1377,Geronimo Stilton and the Kingdom of Fantasy #8...,by Geronimo Stilton,4.6,329.0,9385887815,274
1378,Harry Potter and the Deathly Hallows (Harry Po...,by J.K. Rowling,4.6,352.0,1408855712,666
1379,Sita's Ramayana,by Samhita Arni,3.6,5.0,9380340036,542
1380,The Maze Runner #02 Scorch Trials Movie Tie-in,by James Dashner,3.9,56.0,9351039684,247


TypeError: Index does not support mutable operations