# Data Preparation

The Book-Crossing dataset, collected in a 4-week crawl from August to September 2004 by Cai-Nicolas Ziegler, was used for the book recommendation system. It consists of three csv files: Users, Book Ratings, and Books. Due to the magnitude of the file sizes, the files were loaded onto postgreSQL for storage and easy access. An intial attempt at saving the files led to an encoding error. In this notebook, we will read in the csv files, rename the columns and save out the files again. This will resolve the encoding error and allow us to save the dataset in our postgres database.  

SQL file used to create these tables can be found [here](http://localhost:8888/lab/tree/dsi%2FProjects%2Fcapstone%2Fsql_files%2F01_bx_create_tables.sql)

**Data Source:** [University of Freiburg’s Department of Computer Science](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) 

## Content
-  [Users](#Users)
-  [Book Ratings](#Book-Ratings)
-  [Books](#Books)

In [1]:
import pandas as pd

## Users

In order to properly read `BX-Users.csv`, we indicate the encoding as `latin_1`.

In [2]:
users_df = pd.read_csv('../data/BX-CSV-Dump/BX-Users.csv', delimiter=';', encoding='latin_1')

Column names are lower_cased to better align with typical naming conventions.

In [3]:
users_df.columns = ['user_id', 'location', 'age']

In [4]:
users_df.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


The dataframe consists of three columns:
-  `user_id`: unique value
-  `location`: contains the city, state and country. Some users do not have location following this format or are missing the field entirely.
-  `age`: contains missing values

We save `users_df` back as a csv, with `;` as the delimiter to match the original file. The file can now be loaded onto the server.

In [5]:
users_df.to_csv('../data/Cleaned-BX-CSV-Dump/BX-Users.csv', sep=';', index=False)

## Book Ratings

We repeat the same process as we did for `BX-Users`.

In [6]:
book_ratings_df = pd.read_csv('../data/BX-CSV-Dump/BX-Book-Ratings.csv', delimiter=';', encoding='latin_1')

In [7]:
book_ratings_df.columns = ['user_id', 'isbn', 'book_rating']

In [8]:
book_ratings_df.head()

Unnamed: 0,user_id,isbn,book_rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


The dataframe consists of three columns:
-  `user_id`: value that corresponds to`users_df`.
-  `isbn`: unique book identifier
-  `book_rating`: contains either an implicit rating `0` or explicit rating `1-10`

In [9]:
book_ratings_df.to_csv('../data/Cleaned-BX-CSV-Dump/BX-Book-Ratings.csv', sep=';', index=False)

## Books

We repeat the same process as before.

In [10]:
books_df = pd.read_csv('../data/BX-CSV-Dump/BX-Books.csv', delimiter=';', encoding='latin_1', low_memory=False)

There were parsing issues when loading the file into this notebook due to duplicated `;`, where there were more fields than expected. These were removed in vim in order to read in the file correctly in the notebook.

In [11]:
books_df.drop(columns='Unnamed: 0', inplace=True)

In [12]:
books_df.head()

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


The dataframe consists of 8 columns:
-  `isbn`: unique book identifier that can also be found in `book_ratings_df`
-  `book_title`: title of book
-  `book_author`: author of book
-  `year_of_publication`: publication year
-  `image_url_s`: link to a small image of book cover
-  `image_url_m`: link to medium image of book cover
-  `image_url_l`: link to large image of book cover

In [13]:
books_df.to_csv('../data/Cleaned-BX-CSV-Dump/BX-Books.csv', sep=';', index=False)

### Incorrect Values

When trying to load `BX-Book-Ratings.csv` after performing the above steps, there were type errors in multiple columns. We will look at the rows that raised an error message here.

In [14]:
books_df[books_df['image_url_l'].isna()]

Unnamed: 0,isbn,book_title,book_author,year_of_publication,publisher,image_url_s,image_url_m,image_url_l
209550,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...,http://images.amazon.com/images/P/078946697X.0...,http://images.amazon.com/images/P/078946697X.0...,
220744,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...,http://images.amazon.com/images/P/2070426769.0...,http://images.amazon.com/images/P/2070426769.0...,
221691,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...,http://images.amazon.com/images/P/0789466953.0...,http://images.amazon.com/images/P/0789466953.0...,


In [15]:
print(books_df.loc[209550, 'book_title'])
print(books_df.loc[220744, 'book_title'])
print(books_df.loc[221691, 'book_title'])

DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)\";Michael Teitelbaum"
Peuple du ciel, suivi de 'Les Bergers\";Jean-Marie Gustave Le ClÃ?ÃÂ©zio"
DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)\";James Buckley"


The `book_author` for the above three rows were included in the `book_title` column, causing following values to shift to the right. Extra quotation marks were removed in vim in order for values to be parsed and loaded onto the server correctly.