# Project Gutenberg Language Analytics Notebook

### Notebook 03: Market Reach and Language Distribution Analysis

### 1. Initial Language Distribution Analysis

#### 1.1 Basic Language Statistics
```excel
Location: "Language_Analysis"
PivotTable Setup:
- Rows: first_language
- Values: 
  * Count of ID (renamed to "Number of Books")
  * Sum of download_count (renamed to "Total Downloads")
  * Average of download_count (renamed to "Avg Downloads per Book")
  
Sort by: "Number of Books" descending
```

![Language Analysis](../src/Screenshots/Language_Analysis.png)
The data shows a clear dominance of English (en) in the Project Gutenberg collection, with 59,980 titles representing the vast majority of the corpus. French (fr) and Finnish (fi) follow as distant second and third with 3,881 and 3,162 titles respectively. However, the average downloads per title tell a different story about popularity and demand. While English averages 192 downloads per title, some languages with fewer titles show remarkably high average downloads - for example Russian (ru) with 1,432 downloads per title, Early Middle English (enm) with 1,079, and Chinese (zh) with 946 downloads per title. This suggests there might be an underserved demand for content in these languages, as the few titles available are being downloaded very frequently.

The data also reveals a "long tail" distribution - while there are many languages represented, 31 languages have 5 or fewer titles, indicating very limited representation. Some of these rare-language titles show surprisingly high average downloads (like Arabic with 3,396 downloads for its single title) which suggests potential opportunities for expansion in these language markets.


#### 1.2 Create Language Performance Metrics
```excel
# 'language performance' sheet
Column: Download_Percentile
Formula: =PERCENTRANK.INC($G$2:$G$1000,[@download_count])
Where G is your download_count column
```

This metric uses PERCENTRANK.INC to show where each book stands in relation to all other books' download counts
- Values range from 0 to 1, where:
  - 0.90 means the book performs better than 90% of all books
  - 0.50 means it's exactly in the middle of performance
  - 0.10 means 90% of books have more downloads
- It helps identify relative success regardless of absolute download numbers and is particularly useful for comparing books across different languages or genres where raw download numbers might vary significantly

```excel
Column: Language_Performance_Score
Formula: =IF([@download_count]>AVERAGE($G$2:$G$1000),
    [@download_count]/MAX($G$2:$G$1000)*100,
    [@download_count]/AVERAGE($G$2:$G$1000)*50)
```
The formula has two parts based on whether a book is above or below average downloads:

1. For above-average performers:
```excel
[@download_count]/MAX($G$2:$G$1000)*100
```
This compares the book to the most downloaded book and shows how close a successful book is to the "best possible" performance, It scales from 0-100, with 100 being the best performing book

2. For below-average performers:
```excel
[@download_count]/AVERAGE($G$2:$G$1000)*50
```
Each title in this bracket is compared to the average rather than maximum and Helps identify books that might be underperforming but still have potential.
It is scaled to 50 to differentiate from high performers.

I find that this dual-scale approach is valuable because:
1. It clearly separates high and low performers (above/below 50)
2. It prevents extremely high download counts from skewing the analysis
3. It gives more nuanced performance data for lower-performing books
4. It helps identify both "star performers" and "potential improvers"

This kind of analysis would be valuable for identifying which types of books in which languages consistently perform well, finding underappreciated books that might need more marketing and consequently making data-driven acquisition decisions.

![Language Performance](../src/Screenshots/Language_Performance.png)

### 2. Multilingual Analysis

#### 2.1 Create Language Combination Metrics
```excel
# 'language combinations' sheet
The setup is as follows:
- Rows: first_language, second_language
- Values:
  * Count of ID (show as % of grand total)
  * Sum of download_count
  * Average of download_count
```
![Language Combinations](../src/Screenshots/Language_Combinations.png)

Based on the data, I can see some interesting patterns... As a reminder, all of thee observations are based on the Project Gutenberg corpus, which gathers books that are out of copyright in the US, so the dataset is limited compared to a current publishers list. However, using this very clean dataset allows me to show I approach analysis in publishing rather than getting actionable insights I can leverage in a role.

1. One category I call 'Untapped Opportunities' because they have high average downloads with a low percentage of titles:
- Middle English (enm): 1,180 avg downloads, <1% of titles. There are limited numbers of available Middle Engglish texts that have survived history, but if a company could produce high quality Modern English translations alongside the original text, there could be a market for steudents and academics. 
- Korean (ko): 894 avg downloads, <1% of titles. There are likely very few Korean language texts that have publishd in the US and are now out of copyright, but with the boom in Korean culture, I can see that translations would be popular with a younger market
- Frisian (fy): 843 avg downloads, <1% of titles. Fisian is a very niche language with 400,000 people speaking it globally. Although this seems to have potential, I think that a local publisher or imprint might have more success than a larger imprint.
- Spanish (es): 703 avg downloads, 0.03% of titles
- Chinese (zh): 527 avg downloads, 0.01% of titles. Both Chinese and Spanish are spoken my millions (if not billions) of people across the world, and so if a publisher wanted to produce translations for speakers in the US, there might be an untapped market.

2. English Language Pair Performance:
English is the primary language with the highest downloads (11,496,536 total downloads, 192 avg downloads) which as US focused organisation I would expect. English as secondary language with other languages shows higher average downloads (459-675 per title) but has a much lower volume, suggestingpotential for more translated works into English.

3. European Language Performance:
- Spanish leads with 187 avg downloads
- Portuguese: 134 avg downloads
- German: 124 avg downloads
- French: 122 avg downloads
- Italian: 92 avg downloads

Therefore, my business recommendations for a larger publisher, based on these findings would be:
1. Immediate Opportunities:
   - Explore to see which Middle English classics have no publisher and for which high quality translations could be made (highest avg downloads for secondary language)
   - Consider East Asian language translations (Korean, Chinese show high engagement)
   - Expand Spanish-language content (high downloads, low current percentage)

2. Strategic Expansion:
   - Focus on English translations of successful foreign works (higher than average downloads)
   - Develop more multilingual editions in European languages (consistent performance)

3. Market Testing:
   - Start with small collections in high-performing but underrepresented languages
   - Monitor performance of multilingual editions in European market combinations

### Next Steps and Recommendations
With an extended dataset and links to Nielsen Bookscan, I would:
1. Cross-reference with market size data
2. Analyze seasonal patterns in downloads 
3. Create predictive models for future performance
4. Identify underserved language markets further

### Version Control
- Filename: `ProjectGutenberg_Analysis/excel/language_analysis.xls`
- Changes: Initial cleaning process documented


---
**Note**: This notebook is part of the Project Gutenberg Analysis portfolio project.