# Part 1: Query Annotation

Identify patterns in the data and create labels to group similar keywords.

- Feel free to edit the spreadsheet as needed (re-sort, add columns etc.)  
- Summarize your findings. This should include  
    - a description of each group  
    - % of total keywords that each group contains  
    - any other notes/observations about the data or your methods for selecting the groups  

## Transforming the data ##
After recieving the xlsx file, I added some cursory column labels. I want to use pandas multiindexing to index the queries by language and search intent. I added the language manually using [ISO 2 letter language codes](https://www.sitepoint.com/iso-2-letter-language-codes/), and read the file into this notebook.

In [20]:
import pandas as pd
pd.set_option('display.max_rows', 10)
df = pd.read_excel('THED_ISOCoded.xlsx', sheet_name = 'Part 1 Queries')
df

Unnamed: 0,query,keyword intent,likely user intent,language,translation
0,baixar whatsapp,N,Download Whatsapp,PT,download whatapp
1,options trading beginners,IV,Get more info,EN,
2,stainless steel pan,T,,EN,
3,cheap auto insurance,IV,,EN,
4,chase credit card login,,,EN,
...,...,...,...,...,...
245,新車 軽 自動車 価格,,,JP,
246,plumbing services near me,,,EN,
247,arthritus treatments,,,EN,
248,voyage dinnerware,,,EN,


## Next I create a new data frame, indexing by language. ##
Then I write the new data frame to a new xlsx file. 

In [6]:
# applying the .sort_index function to group and list the indexes alphabetically
multi = df.set_index(['language']).sort_index()
multi

Unnamed: 0_level_0,query,keyword intent,likely user intent,translation
language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DE,aktueller heizöl tagespreis,,,
DE,privatverkauf wohnung,,,
DE,autohandel gebrauchtwagen,,,
DE,mieter für wohnung gesucht,,,
DE,steuererklärung formular,,,
...,...,...,...,...
PT,jogar minecraft,,,
PT,pepper trees for sale,,,
PT,geladeira em promoção,,,
PT,transmissão ao vivo futebol,,,


In [19]:
'''
multi.to_excel("THED_LangInd.xlsx")
'''

## I did some quick translations... 
Indexing by language first made it easy for me to quickly translate non-English langauge queries using a combination of Google Translate, my own abilities, and online research. These translations were helpful for labeling the search intent.

I labeled the search intent myself based on what I thought the dominant intent would be. Then I wrote my progress to a new spreadsheet.

In [7]:
language_index_df = pd.read_excel('THED_LangInd.xlsx')
search_index_df = language_index_df.set_index(['language', 'search intent']).sort_index()
search_index_df

Unnamed: 0_level_0,Unnamed: 1_level_0,query,likely user intent,translation
language,search intent,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DE,Commercial,privatverkauf wohnung,,private sale apartment
DE,Commercial,wohnungsvermietung privat,,apartment rental private
DE,Informational,aktueller heizöl tagespreis,,Current heating oil daily price
DE,Informational,mieter für wohnung gesucht,,tenant wanted for apartment
DE,Informational,trauertexte für beileidskarten,,mourning texts for sympathy cards
...,...,...,...,...
PT,Navigational,baixar whatsapp,Download Whatsapp,download whatapp
PT,Navigational,meracdo livre brasil,,live market Brazil
PT,Transactional,emitir boleto pagamento mei,,issue mei(Micro Etreprenuer Individual) paymen...
PT,Transactional,jogar minecraft,,play Minecraft


In [35]:
'''
search_index_df.to_excel('THED_SearchInd.xlsx')
'''

## More annotations ##
I decided to nonexhaustively add [Google Product Taxonomy](https://developers.google.com/google-ads/api/reference/data/codes-formats#expandable-15) in a new column via spreadsheets before reading the file back into the notebook. Assuming the data is representative of a larger set, these categories could represent the products and services that are most relevent to user's queries.  

I especially focus on Commercial and Transactional search intents. Informational queries required much more interpretation. In the interest of time I excluded them.

In [21]:
pd.set_option('display.max_colwidth', 0)
product_codes_df = pd.read_excel('THED_SearchInd.xlsx',
                           # the kwarg 'index_col' allows me to retain multiindexing across pandas & spreadsheets
                          index_col=[0,1])
# reapplying .sort_index function to correct and re-sort any categorical errors fixed using the spreadsheet
product_codes_df = product_codes_df.sort_index()
product_codes_df

Unnamed: 0_level_0,Unnamed: 1_level_0,query,Google Ads API product/service category,translation
language,search intent,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DE,Commercial,privatverkauf wohnung,/Real Estate/Real Estate Listings/Rental Listings/Apartment Rentals,private sale apartment
DE,Commercial,wohnungsvermietung privat,/Real Estate/Real Estate Listings/Rental Listings/Apartment Rentals,apartment rental private
DE,Informational,aktueller heizöl tagespreis,/Business & Industrial/Energy Industry/Oil & Gas/Fuel/Natural Gas,Current heating oil daily price
DE,Informational,mieter für wohnung gesucht,/Real Estate/Real Estate Listings/Rental Listings/Apartment Rentals,tenant wanted for apartment
DE,Informational,trauertexte für beileidskarten,/Occasions & Gifts/Special Occasions/Funerals & Bereavement,mourning texts for sympathy cards
...,...,...,...,...
PT,Navigational,baixar whatsapp,/Internet & Telecom/Internet/Email & Messaging,download whatapp
PT,Navigational,meracdo livre brasil,/Business & Industrial/Business Management/E-Commerce,live market Brazil
PT,Transactional,emitir boleto pagamento mei,/Law & Government/Legal/Legal Forms & Kits,issue mei(Micro Etreprenuer Individual) payment slip
PT,Transactional,jogar minecraft,"/Hobbies & Leisure/Toys & Games/Games/Video Games, Consoles & Accessories/Computer & Video Games",play Minecraft


## Splitting the taxonomies ##
I want to make the document more usable programatically and in spreadsheets. I decided to break the taxonomies up by levels. I will have to write a function to iterate over my list of codes and split them using ' / ' as a delimiter to make a new list of arrays.

In [22]:
#filtering out my NaN values
taxonomy_df = product_codes_df[product_codes_df['Google Ads API product/service category'].notnull()]
#creating a list from the filtered product taxonomies
taxonomy_list = taxonomy_df['Google Ads API product/service category'].tolist()
# this is my list of taxonomies
#taxonomy_list

In [11]:
# no_slash is a function that takes a list (l) as an argument
def no_slash(l):
    #return_value is a container for the desired output
    return_value = []
    # iterating over the list
    for i in l:
        #appending the arrays created by Python's built-in ().split function to my container 
        return_value.append(i.split('/')[1:])
    # returning the desired output
    return return_value

## Array → Pandas Dataframe ##
Using the no_slash function I will create a new list called taxonomy_array. From that list, I create a new dataframe called array_df, setting the columns names to levels 1-6 (l1, l2, etc.).

Filtering out the NaN values in taxonomy_list was essential to transforming the product and services codes, as Python's ().split function breaks when encountering NaN values. As the new array_df is no longer correctly offset by the NaN values, I decided to edit this part in spreadsheets.

In [23]:
taxonomy_array = no_slash(taxonomy_list)
array_df = pd.DataFrame(taxonomy_array, columns = ['l1','l2','l3','l4','l5','l6'])
array_df

Unnamed: 0,l1,l2,l3,l4,l5,l6
0,Real Estate,Real Estate Listings,Rental Listings,Apartment Rentals,,
1,Real Estate,Real Estate Listings,Rental Listings,Apartment Rentals,,
2,Business & Industrial,Energy Industry,Oil & Gas,Fuel,Natural Gas,
3,Real Estate,Real Estate Listings,Rental Listings,Apartment Rentals,,
4,Occasions & Gifts,Special Occasions,Funerals & Bereavement,,,
...,...,...,...,...,...,...
150,Internet & Telecom,Internet,Email & Messaging,,,
151,Business & Industrial,Business Management,E-Commerce,,,
152,Law & Government,Legal,Legal Forms & Kits,,,
153,Hobbies & Leisure,Toys & Games,Games,"Video Games, Consoles & Accessories",Computer & Video Games,


In [None]:
'''
array_df.to_excel('array_df.xlsx')
'''

## The final document ##

In [24]:
final_df = pd.read_excel('THED_level_codes.xlsx',
                           # the kwarg 'index_col' allows me to retain multiindexing across pandas & spreadsheets
                          index_col=[0,1])
# reapplying .sort_index function to correct and re-sort any categorical errors fixed using the spreadsheet
final_df = final_df.sort_index()
# filtering to show 
final_df = final_df.filter(items = ['query', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6'])
final_df

Unnamed: 0_level_0,Unnamed: 1_level_0,query,L1,L2,L3,L4,L5,L6
language,search intent,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
DE,Commercial,privatverkauf wohnung,Real Estate,Real Estate Listings,Rental Listings,Apartment Rentals,,
DE,Commercial,wohnungsvermietung privat,Real Estate,Real Estate Listings,Rental Listings,Apartment Rentals,,
DE,Informational,aktueller heizöl tagespreis,Business & Industrial,Energy Industry,Oil & Gas,Fuel,Natural Gas,
DE,Informational,mieter für wohnung gesucht,Real Estate,Real Estate Listings,Rental Listings,Apartment Rentals,,
DE,Informational,trauertexte für beileidskarten,Occasions & Gifts,Special Occasions,Funerals & Bereavement,,,
...,...,...,...,...,...,...,...,...
PT,Navigational,baixar whatsapp,Internet & Telecom,Internet,Email & Messaging,,,
PT,Navigational,meracdo livre brasil,Business & Industrial,Business Management,E-Commerce,,,
PT,Transactional,emitir boleto pagamento mei,Law & Government,Legal,Legal Forms & Kits,,,
PT,Transactional,jogar minecraft,Hobbies & Leisure,Toys & Games,Games,"Video Games, Consoles & Accessories",Computer & Video Games,


# Part 1 Summary 
Using pandas multiindexing, I organized queries into groups by setting their indexes to language and search intent. I created new columns named l1-l6 that represent the progressively specific levels of their associated Google Product Taxonomy.  

As I wanted to create a usable document to be able to explore the commonalities in the data, I used the document to answer 2 self-imposed questions.  

*what is the top search intent for each language?* ***and** *what is the top L1 (highest level Google Product Taxonomy) for each langauge?*

## Language
**81.2%** of queries were in **English**  
**5.2%** were in **Spanish**  
**4%** were in **German**  
**4%** were in **Japanese**  
**2.8%** were in **Portuguese**  
and another **2.8%** of queries were of **undeterminable language** by query alone.

## Search Intent 
**37.6%** of queries were of **Commercial** search intent    
**37.6%** were **Informational**    
**9.6%** were **Navigational**  
**14.4** were **Transactional**  
and **0.8%** **needed more information** to identify.

## Top Search Intents by Language
In the cells below I used filtering and some built-in functions to find the top search intent by langauge.  

English: Commercial intent comprised **41.9%** of queries.  
Spanish: Transactional intent comprised **38.5%** of queries.  
German: Informational & Navigational each comprised **40%** of queries  
Japanese: Commercial intent comprised **50%** of Japanese queries.  
Portuguese: Navigational & Transactional intent each comprised **42.9%** queries. 


In [364]:
# the kwarg 'level' allows me to apply the .size() function to the indexes to find the most frequently appearing search intent
# for each language
final_df.groupby(level=[0,1]).size()

language  search intent
DE        Commercial       2 
          Informational    4 
          Navigational     4 
EN        Commercial       85
          Informational    79
          NEI              1 
          Navigational     11
          Transactional    27
ES        Commercial       1 
          Informational    4 
          Navigational     3 
          Transactional    5 
JA        Commercial       5 
          Informational    1 
          Navigational     3 
          Transactional    1 
NEI       Informational    6 
          NEI              1 
PT        Commercial       1 
          Navigational     3 
          Transactional    3 
dtype: int64

In [439]:
# Pandas' .xs() function takes multiindex labels as an argument and filters the dataframe by those labels
# Each expression represents the amount of queries containing the top search intent divided by the total amount 
# of queries in that language
print(
'\n English: ', len(final_df.xs(('EN', 'Commercial'))) / len(final_df.xs('EN')),
'\n Spanish: ', len(final_df.xs(('ES', 'Transactional'))) / len(final_df.xs('ES')),
'\n German: ', len(final_df.xs(('DE', 'Informational'))) / len(final_df.xs('DE')),
'\n Japanese: ', len(final_df.xs(('JA', 'Commercial'))) / len(final_df.xs('JA')),
'\n Portuguese: ', len(final_df.xs(('PT', 'Transactional'))) / len(final_df.xs('PT')))


 English:  0.4187192118226601 
 Spanish:  0.38461538461538464 
 German:  0.4 
 Japanese:  0.5 
 Portuguese:  0.42857142857142855


## Top L1 by Language

English: Finance  
Spanish: Law & Government   
German: Real Estate  
Japanese: Internet & Telecom  
Portuguese: Apparel 

In [16]:
# Filtering final_df by each langauge.
german = final_df.xs('DE')
english = final_df.xs('EN')
spanish = final_df.xs('ES')
japanese = final_df.xs('JA')
portuguese = final_df.xs('PT')

# applying the built-in functions .value_count() and .idxmax() to find the most frequently appearing string in the 'L1' column.
print('Top L1 per language: ' +
'\n English: ' + english['L1'].value_counts().idxmax(),
'\n Spanish: ' + spanish['L1'].value_counts().idxmax(),
'\n German: ' + german['L1'].value_counts().idxmax(),
'\n Japanese: ' + japanese['L1'].value_counts().idxmax(),
'\n Portuguese: ' + portuguese['L1'].value_counts().idxmax())

Top L1 per language: 
 English: Finance 
 Spanish: Law & Government 
 German: Real Estate 
 Japanese: Internet & Telecom 
 Portuguese: Apparel


_______________________
# Part 2: Result Quality Evaluation #
Create a rubric to score the pairs of queries and article titles.  

Not all bad matches are created equally. Make sure that your rubric is granular enough to distinguish
between multiple levels of “bad”. Label the pairs according to your scoring rubric and summarize the
evaluation results.

## Scoring Rubric ##
I created this scoring rubric to classify these pairs by how related the query and user intent are to the title of the article. I also took into account how likely a user is to be satified by the information in the article or if they would have to look elsewhere for more relevant information.  

**Very Poor** - Article is irrelevant or possibly matching with non-topic keyword or synonyms. A user would likely skip this article.  
**Poor** - Article title is only slightly related to the keyword. A user would likely skip this article.  
**Less Acceptable** - The article offers broader or more specific information than queried. A user may or may not click this article.  
**Acceptable** - Article title fits the topic of the keyword and satisfies user intent. This rating is for an article where the queried information exists in the body text, but requires more effort to find.      
**Good** - Article satisfies user intent and fits the topic. This rating is for an immediately satifying article.   

In [178]:
results_df = pd.read_excel('take-home_eval-data.xlsx', sheet_name = 'Part 2 Results')
# filtering to exclude notes I took on the spreadsheet for reference
results_filtered_df = results_df.filter(items = ['keyword', 'title', 'rating'])
# sorting alphabetically by rating to group like ratings for readability
results_filtered_df.sort_values(by=['rating'])

Unnamed: 0,keyword,title,rating
31,distance learning mba online,10 of the Best MBA Programs,Acceptable
12,ninja blender,The Best Blenders for Home Use,Acceptable
6,what should my sugar level be,What Is a Blood Glucose Level?,Acceptable
34,single moms grants,15 Helpful Financial Aid Grants for Single Mothers,Good
8,best cruise packages,10 Luxury Cruise Packages for Your Next Vacation,Less Acceptable
19,social media marketing,15 Businesses With a Great Social Media Presence,Less Acceptable
9,transfer money from australia to new zealand,Best Platforms for Transferring Money Internationally,Less Acceptable
16,custom chef hat,The Best Cookware for Every Cook,Less Acceptable
5,moving overseas shipping,How to Ship a Vehicle,Less Acceptable
25,travel jewelry case,Best Luggage for All Your Travels,Less Acceptable


# Part 2 Summary #
### Relevant Overall Group ##
The Relevant Overall Group, containing the scores **Less Acceptable**, **Acceptable**, and **Good**. This group contained **32.4%** of the keyword and title pairs. On the Less Acceptable end of the scale, a user is only *somewhat* likely to click the article. The higher the score, the more likely a user would be to click that article and have it serve their intent.  

Matches with scores in the Relevant Overall Group should be fairly likely to be clicked on.  

### Irrelevant Group ###
The Irrelevant Group contained the **Very Poor** and **Poor** scores. This group contained **67.6%** of the keyword and title pairs. This group is distinguished from the Relevant Overall Group by the high likelihood of users skipping these articles entirely.  

**Matches in the Irrelevant Group risk users having a frustrating search experience by showing low quality results.** They would be unlikely to be clicked on. However, these matches could be insightful for editorial topic generation. 

________________________________________________
# Part 3: Translation Evaluation #
 Rate the quality of the translations.  

## Japanese text translated to English

* Do the English translations capture the topic and user intent? 

In [166]:
translations_df = pd.read_excel('take-home_eval-data.xlsx', sheet_name = 'Part 3 Translations')
translations_df.iloc[:18]

Unnamed: 0,input,translation,captures topic,captures user intent
0,高 性能 ノート パソコン 激安,High -performance laptops discount,True,True
1,太陽 光 発電 ソーラー パネル,Solar power generation solar panel,True,True
2,階段 昇降機,Staircase elevator,True,True
3,老人 ホーム,Nursing home,True,True
4,型 落ち デスクトップ パソコン,Dropped desktop PC,False,False
5,ローレックス 買い取り 価格,Lawx purchase price,False,False
6,zoom アプリ 無料,ZOOM app free,True,True
7,介護 付き 有料 老人 ホーム,Paid nursing home with nursing care,True,True
8,スマホ で 監視 防犯 カメラ,Surveillance and security camera with a smartphone,True,True
9,ローレックス 買取 相場,Lawx purchase price,False,False


## Summary ##
### Japanese → English ###
In general, the translations were acceptable and if they retained their topic they also retained their user intent.  

**72.2%** of inputs **retained** their topic and user intent.  
**27.8%** of inputs **did not retain** topic or user intent.  

For the **27.8%** of inputs that **did not retain their topic or user intent**, there are 3 reasons why.  

**40%** were affected by **polysemy**, the many possible meanings for a word or phrase.  
**40%** were mistranslated **katakana words.**  
**20%** were mistranslated due to **spacing.**  

### 1. Polysemy ###  
At **index 17**, translating 'hp 法人 向け'(eichipi- houjinmuke) to 'For HP corporation' introduces ambiguity not present in the Japanese input. The original input expresses the user intent of somebody seeking to purchase HP laptops in bulk at a reduced price per unit as part of a B2B sale. This translation captures the 'for' from '向け' (muke) and '法人' (koujin) kind of blends with 'HP' to from 'HP corporations.' Spacing within the compound word '法人向け', likely exacerbates this problem.  

At **index 12**, translating '募集 求人' to 'Recruitment' condenses both Japanese words into one English word. Semantic aspects of both words include the idea of recruitment, but the translated query 'Recruitment' does not capture the intent of looking for job postings. This was likely unaffected by spacing. Though these words are used in close proximity to each other, '募集求人' is not a compound word.  

### 2. Katakana mistranslations ###
At **index 5 and 9**, 'ローレックス' (ro-rekkusu) is mistranslated as 'lawx.' The correct translation should be Rolex, as in the luxury watch brand. 

Katakana is a Japanese writing system that is mostly used for foreign words. Katakana characters represent a sylabble comprised of a consonant and vowel pair. The mistranslation at **index 5 and 9** is unusual because the derivation of the character 'x' in 'lawx' suggests the algorith is breaking the katakana characters into smaller phonemes and excluding the /r/ sound to create 'lawx.' This is a somewhat sophisticated mistake. In the case of an insufficient corpus such that 'ローレックス' would not be translated to 'Rolex', a more expected mistranslation would be something like 'lawlex' or 'lawrex.' 

### 3. Spacing ###
There are half-width spaces between and within words in every query on this list. These spaces can cause issues with machine translation.  

Seeing half-width spaces suggests that these are keywords suggested to a user by Google, rather than ones input entirely by the user. This is because Japanese uses full-width spaces and only between sentences or clauses. It's been [documented](https://www.humblebunny.com/multilingual-seo-japanese-research/) that when Google suggests keywords they insert these half-width spaces between *words.* A suggested keyword is functionally identical to the same query that is typed without half-width spaces, but insertion of these spaces often breaks up compound words and could lead to mistranlsations.  

At **index 4** for example, the suggested query [型 落ち デスクトップ パソコン](https://www.google.com/search?q=%E5%9E%8B+%E8%90%BD%E3%81%A1+%E3%83%87%E3%82%B9%E3%82%AF%E3%83%88%E3%83%83%E3%83%97+%E3%83%91%E3%82%BD%E3%82%B3%E3%83%B3&rlz=1C1FKPE_enUS965US965&sxsrf=ALiCzsbe6s4Vuwf1QzbUuwWCvL0Msp1Xjw%3A1656985137306&ei=MZbDYuGrEsTa9AOM_YXACg&ved=0ahUKEwihiIfUzuD4AhVELX0KHYx-AagQ4dUDCA8&uact=5&oq=%E5%9E%8B+%E8%90%BD%E3%81%A1+%E3%83%87%E3%82%B9%E3%82%AF%E3%83%88%E3%83%83%E3%83%97+%E3%83%91%E3%82%BD%E3%82%B3%E3%83%B3&gs_lcp=Cgdnd3Mtd2l6EAMyBAgjECc6BwgjEOoCECdKBAhBGABKBAhGGABQqQZYqQZgigloAXABeACAAT-IAT-SAQExmAEAoAEBoAECsAEKwAEB&sclient=gws-wiz) would return the same SERP as [型落ちデスクトップパソコン](https://www.google.com/search?q=%E5%9E%8B%E8%90%BD%E3%81%A1%E3%83%87%E3%82%B9%E3%82%AF%E3%83%88%E3%83%83%E3%83%97%E3%83%91%E3%82%BD%E3%82%B3%E3%83%B3&rlz=1C1FKPE_enUS965US965&sxsrf=ALiCzsYwss3acSViC2BP5n6GfJgytUZl0w%3A1656985726306&ei=fpjDYuehEvKG0PEPqsKU0AQ&ved=0ahUKEwjn2PTs0OD4AhVyAzQIHSohBUoQ4dUDCA8&uact=5&oq=%E5%9E%8B%E8%90%BD%E3%81%A1%E3%83%87%E3%82%B9%E3%82%AF%E3%83%88%E3%83%83%E3%83%97%E3%83%91%E3%82%BD%E3%82%B3%E3%83%B3&gs_lcp=Cgdnd3Mtd2l6EAMyBQgAEIAEOgcIIxDqAhAnSgQIQRgASgQIRhgAUNQBWNQBYOYJaAFwAXgAgAE8iAE8kgEBMZgBAKABAaABArABCsABAQ&sclient=gws-wiz), but in this case the suggested keyword was mistranslated due to the insertion of spaces.  

## Conclusions ##
1. Polysemic mistranslations comprised **40%** of mistranslations and [could point to a low quality corpus in the MT system used.](https://sienstranslation.com/en/blog/2020/12/23/most-frequent-errors-in-machine-translation/)   
2. Katakana mistranslations also comprised **40%** of mistranslations, although both examples were of the same word mistranslated the same way. The deconstruction of katakana syllables into smaller phonemes was sophisticated in a way, but I would not expect a MT system trained on sufficient corpora to fail to translate 'Rolex.'  
3. Spacing affected all the queries, but only **20%** were mistranslated *wholly* because of the insertion of half-width spaces. I expect that mistranslation due to spacing would be mitigated to an extent by the ideographic meaning of Kanji, the system of Japanese writing using Chinese ideographic characters. 

If this data were representative of a larger data set, nearly 30% would be mistranslated. **This would make for a poor user experience and risks user retention.** 

Mistranslations could be reduced by:
1. Improving or implementing [word sense disambiguation](http://www.scholarpedia.org/article/Word_sense_disambiguation) to reduce mistranslations due to polysemy. 
2. Accounting for how suggested keywords differ from user generated keywords in regard to spacing. 

____________________________________
# Part 3 continued #

## English text translated to Japanese  
* Are there grammatical errors? Would the Japanese translations seem awkward or otherwise not
    make sense to a Japanese reader?

In [138]:
part3_df = translations_df.filter(items = ['input', 'translation'])
part3_df.iloc[18:]

Unnamed: 0,input,translation
18,Tips for Buying a Beach Home,ビーチの家を買うためのヒント
19,The Best Cookware for Every Cook,すべての料理人に最適な調理器具
20,How Often Should You Wash Your Bed Sheets?,どのくらいの頻度でベッドシートを洗う必要がありますか？
21,4 Reasons to Start Investing in Travel and Hospitality,旅行とホスピタリティへの投資を開始する4つの理由
22,5 Funny Quotes from History,歴史からの5つの面白い引用
23,"Are you a frequent traveler? Do you head out of town every once in a blue moon? No matter how often you travel, quality luggage is a good investment. You need to be able to rely on your luggage to keep your belongings protected when you’re on the move. Broken zippers and seams are a pain no matter how often (or not) you’re in the airport.",あなたは頻繁に旅行しますか？ あなたは青い月に一度町を出ますか？ あなたがどれほど頻繁に旅行しても、高品質の荷物は良い投資です。 移動中に荷物を保護するために、荷物に頼ることができる必要があります。 空港にどれだけ頻繁に（またはそうでなくても）、壊れたジッパーと縫い目は苦痛です。
24,"The only thing that should hold you back from dining on these delicious sea critters is an allergy to shellfish. Otherwise, you might want to consider taking a weekend off to make a foodie trip to Australia. The king oysters found in the south-central Coffin Bay are huge, succulent and intensely flavorful. If you’d like to make a day or an evening of it, you can even join one of the many Coffin Bay oyster tours and dine on some of the freshest, tastiest seafood in the world. Oh, and wine is often included.",これらのおいしい海の生き物で食事をするのを妨げる唯一のことは、貝に対するアレルギーです。 それ以外の場合は、オーストラリアへの食通旅行をするために週末を休むことを検討することをお勧めします。 コフィンベイの中央南部で見つかったキングオイスターは、巨大でジューシーで、非常に風味豊かです。 一日または夜を過ごしたい場合は、多くのコフィンベイオイスターツアーの1つに参加して、世界で最も新鮮でおいしいシーフードのいくつかで食事をすることもできます。 ああ、そしてワインがしばしば含まれています。


## English → Japanese Summary ##
I am not a native Japanese speaker, but there are no glaring grammatical errors that I can see in any of the translated Japanese. I believe that a Japanese reader would understand all of the translated article titles. However, some of the article text is awkward and may cause slight confusion.  

### AMG Network Article Titles ###
The translated article titles are all grammatically correct and retain their meaning. There are some phrases that might be optimized for better search visibility.
* For instance at **index 18**, 'ビーチの家' (bi-chi no ie) meaning beach house is a broad query and returns all sorts of phrases relating to beaches and houses. Something like '海沿いの家' (umizoi no ie) meaning seaside house returns real estate listings and articles similar to the orginal AMG Network article.
* At **index 20**, 'ベッドシート' (beddo shi-to) might be improved by using a more common synonym 'ベッドシーツ' (beddo shi-tsu).

### AMG Network Article Text ###
Each article suffered from unique translation issues, but the translated text is all grammatically correct.  

The translated article text at **index 23** suffers mostly from **literal translation.** It also sounds a bit unnatural. 
* Repitition of the second person pronoun 'あなた' (anata) in the first 3 sentences in the example at index 23 sounds unnatural. In Japanaese it sounds more natural to omit the subject or topic of the sentence, only mentioning it if context would otherwise fail.  
* 'Once in a blue moon' is translated literally and the idiomatic meaning is lost in translation.  
* 'Pain' as in an annoyance is translated to '苦痛' (kutsuu) meaning anguish, which sounds bizzarely intense.  

The input article text at **index 24** contains long and wordy clauses that form complex sentences. Even so, I believe this is the better translated of the two texts. The translated text suffers from **ambiguity** created by translating certain English prases containing 'oyster' into katakana.
* The translation of 'king oyster' to 'キングオイス' (kingu oisuta-) never connects the katakana spelling of the Englsh word 'oyster' to the more common Japanese word for oyster, which is 'カキ' (kaki). I think that it's very likely that a Japanese reader would infer from the context that this article is written about oysters, but it's not stated explicitly in the translation.  
* The translation of 'ああ' (aa) in this article may or may not convey the fun and informational tone of the input text. I don't often see 'ああ' (aa) in writing outside of dialog or personal correspondence. It could seem strange, but I can't say for sure. 

## Conclusions ##
**The translated article titles are acceptable**. From an SEO standpoint, they could be improved by researching synonymous keywords in the target language.  

The translated article text is clearly machine translated. It sounds unnatural in places, and at times it doesn't make sense. **Were the translated article text to go live it could risk repelling a readership.**  

Implementing an editorial style guide to improve the performance of machine translation would be worth exploring. Utilizing controlled langauge has been instrumental in creating documentation, manuals, and various other technical information that is easily machine translatable. I wonder if this could be achieved editorially as well.  

______________
# Notes by index #
### Japanese → English ###

4). '型落ち' (kataochi) means out-of-production or an older model. This phrase was likely mistranslated due to spacing in the original query '型 落ち'. '型落ち" (kataochi) is formed from the word '型' (kata) meaning type, form, or standard and '落ち' (ochi) the noun stem of the verb '落ちる' (ochiru) which means to fall or to drop.  


5). 'ローレックス' (ro-rekusu), is meant to represent the luxury watch brand Rolex.  


9). See Note for index 5.


11). The queries at index 7 '介護 付き 有料 老人 ホーム' and index 11 '介護 付 有料 老人 ホーム' are identical except for the ommission of the hiragana character 'き' (ki) in query 11. The machine translation is still accuracte despite this due to the ideographic meaning of the character '付' which comes from the verb '付く' (tsuku) in this case.  


12). The two words in this query are '募集' (boshuu) and '求人' (kyuujin). Both words refer to 'recruitment' of some kind. '募集' (boshuu) means recruitment by means of advertisement or solicitation, while '求人' (kyuujin) is a job listing. The translated query would return general informational content about recruitment, but not necessarily job listings.  


14). Translating the query '探偵 浮気 調査'(tantei uwagi chousa) to 'detective cheating survey' retains the topic. However, the user intent is ambiguous as 'survey' can be interpreted as a data aquisition method. It can also mean 'survey' as in an examination or investigation. The English keyword 'survey' returns articles about extramarital detective work _citing_ surveys higher on the SERP, where 'investigation' might return pricing information about extramarital detective services. This semantic ambiguity exists in the Japanese word '調査' (chousa) as well.  


17). In the query 'hp 法人 向け' (eichipi- houjinmuke), [' 法人 向け' (houjinmuke) means products or services targeted towards corprate organizations rather than individual consumers, usually with reduced a price per unit.](https://www.weblio.jp/content/%E6%B3%95%E4%BA%BA%E5%90%91%E3%81%91#:~:text=%E5%80%8B%E4%BA%BA%E3%81%A7%E3%81%AF%E3%81%AA%E3%81%8F%E6%B3%95%E4%BA%BA%E7%B5%84%E7%B9%94,%E3%81%A6%E3%81%84%E3%82%8B%E3%81%93%E3%81%A8%E3%81%8C%E5%A4%9A%E3%81%84%E3%80%82)  

### English → Japanese ###

18). 'ビーチの家' (bi-chi no ie) is a suitable translation for (beach house). However, a synonym like '海辺の家' (umizoi no ie) could improve the precision of results as 'ビーチ' (bi-chi) is matching for articles containing tips for a beach lifestyle without much mention of real estate. 


19). and 20). Both return relevant results. The use of the polite register in translation 20 seems conversational, and quite different from other Japanese queries I've seen while working on this assignment.  



21). Though grammatically correct, this translated query didnt return precise results via the first SERP on Ask, Google, or Yahoo Japan. Precision was highest using Yahoo.com. 

23). The repetition of 'あなた' in the first three sentences sounds strange, as the syntactic subject or topic is often ommited. The idiomatic expression 'once in a blue moon' translates literally and won't make sense. 'Pain' gets translated literally as well and sounds intense, where 'inconvenient' may fit the tone better.

24). Though it is 5 sentences long, which is as many as the previous text has, it has longer and more complex sentences. Despite this, I believe it is the better translated of the two. 




 ________________________________________________ 
## Additional resource gathering ##
I met with Sasha on 6/23/22 after I had some time with the data. I asked for guidance on terminology that I was confused about regarding the terms search intent and user intent. I also recieved guidance on how granular I might make the low end my scoring rubric for part 2 of the assignment.  

I met with Sasha again on 6/30/22 and I was advised to think about how low quality matches or translations could affect the business, as well as to think about bucketing groups into larger trends.

I met one last time with Sasha on 7/7/22. I briefly ran her throug hthe notebook and she suggested that I add some of our points of discussion to my summary in part 3.
